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Abstract 


Planning  is  an  essential  part  of  intelligent  behavior  and  a  ubiquitous  task  for  both  humans  and 
rational  agents.  One  framework  for  planning  in  the  presence  of  uncertainty  is  probabilistic  plan¬ 
ning,  in  which  actions  are  described  by  a  probability  distribution  over  their  possible  outcomes. 
Probabilistic  planning  has  been  applied  to  different  real-world  scenarios  such  as  public  health, 
sustainability  and  robotics;  however,  the  usage  of  probabilistic  planning  in  practice  is  limited  due 
to  the  poor  performance  of  existing  planners. 

In  this  thesis,  we  introduce  a  novel  approach  to  effectively  solve  probabilistic  planning  prob¬ 
lems  by  relaxing  them  into  short-sighted  problems.  A  short-sighted  problem  is  a  relaxed  problem 
in  which  the  state  space  of  the  original  problem  is  pruned  and  artificial  goals  are  added  to  heuris- 
tically  estimate  the  cost  of  reaching  an  original  goal  from  the  pruned  states.  Differently  from 
previously  proposed  relaxations,  short-sighted  problems  maintain  the  original  structure  of  ac¬ 
tions  and  no  restrictions  are  imposed  in  the  maximum  number  of  actions  that  can  be  executed. 
Therefore,  the  solutions  for  short-sighted  problems  take  into  consideration  all  the  probabilistic 
outcomes  of  actions  and  their  probabilities.  In  this  thesis,  we  also  study  different  criteria  to 
generate  short-sighted  problems,  i.e.,  how  to  prune  the  state  space,  and  the  relation  between  the 
obtained  short-sighted  models  and  previously  proposed  relaxation  approaches. 

We  present  different  planning  algorithms  that  use  short-sighted  problems  in  order  to  solve 
probabilistic  planning  problems.  These  algorithms  iteratively  generate  and  execute  optimal  poli¬ 
cies  for  short-sighted  problems  until  the  goal  of  the  original  problem  is  reached.  We  also  for¬ 
mally  analyze  the  introduced  algorithms,  focusing  on  their  optimality  guarantees  with  respect  to 
the  original  probabilistic  problem.  Finally,  this  thesis  contributes  a  rich  empirical  comparison 
between  our  algorithms  and  state-of-the-art  probabilistic  planners. 
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Chapter  1 
Introduction 


Planning  is  an  essential  part  of  intelligent  behavior  and  a  ubiquitous  task  for  both  humans  and 
rational  agents  [Newell  and  Simon,  1963].  One  framework  for  planning  is  probabilistic  plan¬ 
ning,  in  which  actions  are  described  by  the  probability  distribution  over  their  possible  outcomes. 
Solutions  to  a  probabilistic  planning  problem  are  policies,  i.e.,  a  mapping  from  states  to  actions. 

In  order  to  illustrate  the  trade-offs  between  different  types  of  policies,  consider  the  proba¬ 
bilistic  problem  of  an  agent  navigating  an  environment  to  reach  a  goal  location  with  two  possible 
paths:  (i)  a  maze;  and  (ii)  a  hallway  with  locked  doors.  The  agent  has  all  the  necessary  keys  to 
open  the  doors  in  the  hallway;  however,  assume  that  with  non-zero  probability,  the  key  jams  in 
the  door  lock,  resulting  in  a  door  that  cannot  be  unlocked.  Figure  1.1  illustrates  this  probabilistic 
planning  problem. 


Figure  1.1:  Example  of  a  probabilistic  planning  problem.  The  agent  has  to  reach  the  goal  location 
from  the  initial  location.  The  two  doors  in  the  hallway  (top)  are  locked  and,  with  non-zero 
probability,  the  key  jams  in  the  doors.  A  jammed  key  cannot  open  a  door. 

One  possible  solution  for  probabilistic  planning  problems  is  a  policy  that  maps  every  state 
of  the  problem  to  an  action.  Solutions  of  this  class,  i.e.,  closed  policies,  are  extremely  powerful 
because  they  encompass  all  the  possible  probabilistic  reachable  states  in  the  environment.  There¬ 
fore,  a  closed  policy  for  the  example  in  Figure  1.1  encompasses  the  cases  in  which:  the  keys  do 
not  jam;  the  key  jams  in  the  first  door;  the  key  opens  the  first  door  and  jams  in  the  second;  and 
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the  complete  solution  of  the  maze.  Suppose  the  probability  of  the  key  jamming  is  0.01,  then 
the  probability  of  not  reaching  the  goal  through  the  hallway  is  1  —  0.992  =  0.0199.  Thus,  with 
probability  0.9801,  the  possibly  large  computational  effort  to  find  the  maze’s  solution  is  wasted, 
since  the  maze  would  not  be  explored. 

The  second  class  of  possible  solutions  is  a  policy  that  maps  only  a  subset  of  states  to  actions. 
Such  policies,  i.e.,  partial  policies,  do  not  address  all  possible  probabilistic  reachable  states  in 
the  environment.  Therefore,  a  state  not  predicted  by  the  partial  policy  might  be  reached  and, 
when  and  if  such  a  state  is  reached,  a  new  partial  policy  has  to  be  computed  and  executed.  In  the 
example  of  Figure  1.1,  a  possible  partial  policy  is  to  reach  the  goal  through  the  hallway  and  not 
consider  the  case  in  which  a  key  jams.  If  a  key  jams,  then  a  new  partial  policy  in  which  the  agent 
backtracks  and  solves  the  maze  is  returned.  Note  that  this  partial  policy  ignores  the  size  of  the 
maze,  and  it  would  be  executed  even  if  the  probability  of  jamming  the  key  is  high  or  if  the  maze 
is  small. 

Algorithms  to  solve  probabilistic  planning  problems  can  be  classified  according  to  the  type 
of  policy  returned  by  them:  probabilistic  planners,  e.g.,  [Barto  et  al.,  1995],  compute  (optimal) 
closed  policies;  and  replanners,  e.g.,  [Yoon  et  al.,  2007],  return  partial  policies.  Since  proba¬ 
bilistic  planners  must  consider  all  the  probabilistic  reachable  states  in  order  to  compute  a  closed 
policy,  their  scalability  is  limited  to  small  problems.  Alternatively,  replanners  compute  partial 
policies  based  on  simplifications  of  the  original  problem  and  are  able  to  scale  up  to  large  prob¬ 
lems.  A  common  simplification  applied  by  replanners  is  to  relax  the  probabilistic  actions  into 
deterministic  actions  [Yoon  et  al.,  2007].  This  action  relaxation  results  in  algorithms  that  are 
oblivious  to  probabilities.  Therefore,  replanners  based  on  action  simplification  obtain  good  per¬ 
formance  in  some  domain  but  poor  performance  in  probabilistic  interesting  problems  [Little  and 
Thiebaux,  2007],  i.e.,  problems  in  which  probabilities  cannot  be  ignored. 

This  thesis  introduces  a  novel  approach  to  solve  probabilistic  planning  problems  by  relaxing 
them  into  short-sighted  problems.  A  short-sighted  problem  is  a  relaxation  in  which  the  state 
space  of  the  original  problem  is  pruned  and  artificial  goals  are  added  to  heuristically  estimate  the 
cost  of  reaching  an  original  goal  from  the  states  pruned.  Figure  1.2  shows  an  example  of  short¬ 
sighted  problem  for  the  probabilistic  planning  problem  depicted  in  Figure  1.1.  This  short-sighted 
problem  example  perfectly  represents  the  hallway  path  and  prunes  the  maze  path  due  to  its  large 
size.  The  locations  A1}  ■  ■  ■  ,A4  represent  artificial  goals,  i.e.,  non-goal  locations  of  the  original 
problem  which  are  goals  for  the  short-sighted  problem. 

Differently  from  previously  proposed  relaxations,  short-sighted  problems  maintain  the  orig¬ 
inal  structure  of  actions  and  no  restrictions  are  imposed  in  the  maximum  number  of  actions  that 
can  be  executed.  For  instance,  the  probabilistic  action  to  open  a  door  is  the  same  in  both  the  orig- 
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Figure  1.2:  Example  of  short-sighted  problem  for  the  probabilistic  planning  problem  in  Fig¬ 
ure  1.1.  Ai  to  A4  represent  artificial  goals.  A  heuristic  is  used  to  estimate  the  cost  of  solving  the 
maze  from  each  location  Ai. 


inal  problem  of  our  example  (Figure  1.1)  and  its  short-sighted  example  (Figure  1.2).  Therefore, 
the  solutions  for  short-sighted  problems  take  into  consideration  all  the  probabilistic  outcomes  of 
actions  and  their  probabilities.  In  this  thesis,  we  study  different  criteria  to  generate  short-sighted 
problems,  i.e.,  to  how  prune  the  state  space,  and  the  relation  between  the  obtained  short-sighted 
models  and  previously  proposed  relaxation  approaches. 

Another  important  aspect  of  short-sighted  problems  is  the  guidance  towards  the  original  goals 
offered  by  the  artificial  goals.  For  instance,  in  the  short-sighted  problem  in  Figure  1 .2,  the  sum  of 
Manhattan  distances  can  be  used  as  a  heuristic  to  estimate  the  cost  of  solving  the  original  problem 
starting  from  each  artificial  goal  A^  Therefore,  an  optimal  closed  policy  for  this  short-sighted 
problem  is  able  to  heuristically  approximate  the  trade-off  between  the  two  different  paths  in  our 
running  example:  if  the  probability  of  a  key  jamming  is  small,  the  hallway  path  is  preferred  since 
it  is  the  shortest  path;  alternatively,  if  the  jamming  probability  is  large,  then  solving  the  (large) 
maze  is  chosen  due  to  the  low  probability  of  successfully  opening  both  doors  in  the  hallway  path. 
Since  short-sighted  problems  are  small  with  respect  to  the  original  problem,  the  computation  of 
an  optimal  closed  policy  for  them  is  feasible. 

This  thesis  then  introduces  different  planning  algorithms  that  use  short-sighted  problems  and 
their  optimal  closed  policies  in  order  to  solve  probabilistic  planning  problems.  These  algorithms 
consist  in  iteratively  generating  and  executing  a  closed  policy  for  short-sighted  problems  until  the 
goal  state  of  the  original  problem  is  reached.  Different  methods  of  combining  the  solutions  from 
short-sighted  problems  are  studied,  including  sequential  and  parallel  approaches.  We  formally 
analyze  the  introduced  algorithms,  focusing  on  their  optimality  guarantees  with  respect  to  the 
original  probabilistic  problem.  Finally,  this  thesis  also  contributes  a  rich  empirical  comparison 
between  the  proposed  algorithms  and  state-of-the-art  probabilistic  planners  and  replanners. 
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Figure  1.3:  Overview  of  the  thesis  approach,  (a)  Representation  of  the  state  space  with  one  short¬ 
sighted  problem  and  (b)  a  sequence  of  short-sighted  problems.  The  initial  state  of  the  problem  is 
represented  by  the  blue  dot,  the  goal  states  are  represented  by  the  green  star.  Each  short-sighted 
problem  is  depicted  as  a  cloud.  States  in  the  border  of  the  cloud  are  artificial  goals  and  the  color 
gradient  in  the  cloud  contour  represents  the  heuristic  cost  to  reach  a  goal  state:  darker  regions  are 
more  costly  than  lighter  regions.  The  red  line  represents  the  states  visited  during  the  execution 
of  a  closed  policy  of  the  respective  short-sighted  problem. 

1.1  Thesis  Question  and  Approach 

This  thesis  seeks  to  answer  the  question, 

How  to  plan  for  probabilistic  environments  such  that  it  scales  up  while  offering 

formal  guarantees  underlying  the  policy  generation? 

We  answer  this  question  by  introducing  new  models  to  represent  subproblems  of  probabilistic 
planning  problems,  developing  new  algorithms  to  exploit  the  proposed  subproblems  and  analyz¬ 
ing,  both  theoretically  and  empirically,  the  proposed  algorithms. 

Precisely,  we  introduce  different  models  to  represent  short-sighted  problems,  i.e.,  subprob¬ 
lems  of  the  original  problem  with  pruned  state  space  and  artificial  goals  to  heuristically  guide  the 
search  towards  the  original  goals.  Figure  1.3(a)  depicts  the  state  space  of  a  probabilistic  plan¬ 
ning  problem  and  the  state  space  of  a  short-sighted  problem.  Each  short-sighted  model  defines  a 
criterion  to  prune  the  state  space  of  the  original  problem  and,  pictorially,  a  short-sighted  model 
governs  the  shape  of  the  clouds  in  Figure  1.3.  We  formally  show  the  relationship  between  the 
optimal  solutions  for  short-sighted  models  and  probabilistic  planning  problems,  e.g.,  the  former 
is  a  lower  bound  for  the  latter. 

Based  on  the  general  definition  of  short-sighted  problems,  we  design  algorithms  that  itera¬ 
tively  generate  and  solve  short-sighted  problems  of  the  original  probabilistic  planning  problem. 
Due  to  the  reduced  size  of  the  short-sighted  problems,  an  optimal  closed  policy  can  be  com¬ 
puted  and  these  policies  are  combined  in  order  to  obtain  a  solution  to  the  original  probabilistic 
planning  problem.  Figure  1.3(b)  depicts  this  process  in  which  a  closed  policy  is  computed  for 
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a  short-sighted  problem  (cloud)  and  executed  (red  line)  until  a  goal  of  the  original  problem  is 
reached;  if  an  artificial  goal  is  reached  (point  in  the  cloud’s  border),  then  this  process  is  repeated 
using  the  reached  artificial  goal  as  the  new  initial  state. 

We  also  prove  the  theoretical  properties  of  the  introduced  algorithms,  e.g.,  guarantee  to  al¬ 
ways  reach  an  original  goal  state  and  convergence  to  the  optimal  solution  of  the  original  problem. 
Finally,  we  empirically  compare  the  proposed  algorithms  and  short-sighted  models  to  understand 
the  different  trade-offs  between  them. 


1.2  Contributions 

The  key  contributions  of  this  thesis  are: 

•  Depth-based,  Trajectory-based  and  Greedy  Short-Sighted  Probabilistic  Problems. 

We  introduce  three  different  short-sighted  models  based  on  different  criteria  to  prune  the 
state  space:  depth-based  short-sighted  problems,  in  which  all  the  states  are  reachable  using 
no  more  than  a  given  number  of  actions;  trajectory-based  short-sighted  problems,  in  which 
all  states  are  reachable  with  probability  greater  or  equal  than  a  given  threshold;  and  greedy 
short-sighted  problems,  in  which  the  states  have  the  best  trade-off  between  probability  of 
being  reached  and  expected  cost  to  reach  the  goal  from  them. 

•  Short-Sighted  Probabilistic  Planner  and  extensions.  We  introduce  the  Short-Sighted 
Probabilistic  Planner  (SSiPP)  algorithm  that  solves  probabilistic  planning  problems  using 
short-sighted  problems.  We  extend  SSiPP  in  three  different  directions:  Labeled  SSiPP, 
which  improves  the  convergence  of  SSiPP  to  the  optimal  solution;  SSiPP-FF,  which  im¬ 
proves  the  efficiency  of  SSiPP  for  generating  suboptimal  solutions;  and  Parallel  Labeled 
SSiPP,  which  solves  multiple  short-sighted  problems  in  parallel  to  speedup  the  search  for 
the  optimal  solution. 

•  Theoretical  and  Empirical  Analysis.  We  prove  the  theoretical  properties  of  our  algo¬ 
rithms,  e.g.,  termination  (i.e.,  always  reach  a  goal  state)  and  optimality.  We  also  provide  a 
comprehensive  empirical  evaluation  of  the  proposed  algorithms  under  different  scenarios: 
(i)  finding  the  optimal  solution;  (ii)  finding  a  solution  with  limited  time  to  compute  the 
next  action  to  be  executed;  and  (iii)  finding  a  solution  under  the  International  Probabilis¬ 
tic  Planning  Competition  [Younes  et  al.,  2005,  Bonet  and  Givan,  2007,  Bryce  and  Buffet, 
2008]  rules. 
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1.3  Guide  to  the  Thesis 

Here  we  outline  the  chapters  that  follow. 

•  Chapter  2  -  Background.  We  review  the  basics  for  Stochastic  Shortest  Path  problems 
(SSPs),  our  chosen  model  to  represent  probabilistic  planning  problems.  We  also  review  the 
following  algorithms  necessary  for  the  next  chapters:  Real-Time  Dynamic  Programming 
[Barto  et  al.,  1995]  and  FF-Replan  [Yoon  et  al.,  2007]. 

•  Chapter  3  -  Short-Sighted  Probabilistic  Planning.  We  present  depth-based  short-sighted 
Stochastic  Shortest  Path  problems,  a  novel  model  to  represent  subproblems  of  SSPs.  We 
also  introduce  the  Short-Sighted  Probabilistic  Planner  (SSiPP)  algorithm  using  the  depth- 
based  short-sighted  SSPs  as  model  for  the  subproblems  generated  by  SSiPP.  We  prove  the 
relations  between  the  solutions  of  SSPs  and  their  depth-based  short-sighted  SSPs  and  that 
SSiPP  is  optimal.  We  conclude  by  showing  the  effectiveness  of  SSiPP  using  depth-based 
short-sighted  SSPs  in  a  proposed  series  of  problems. 

•  Chapter  4  -  General  Short-Sighted  Models.  This  chapter  extends  the  concept  of  depth- 
based  short-sighted  SSPs  to  a  general  model  in  which  a  function  to  prune  the  state  space 
is  given.  Using  this  general  formulation,  we  introduce  two  new  models  for  short-sighted 
problems:  trajectory-based  short-sighted  SSPs  and  greedy  short-sighted  SSPs. 

•  Chapter  5  -  Extending  Short-Sighted  Probabilistic  Planner.  We  present  three  exten¬ 
sions  of  SSiPP:  Labeled  SSiPP,  SSiPP-FF  and  Parallel  Labeled  SSiPP.  We  also  present  the 
theoretical  guarantees  of  each  of  these  algorithms  and  demonstrate  their  effectiveness  in 
different  proposed  domains. 

•  Chapter  6  -  Related  Work.  We  discuss  the  previous  work  in  optimal  and  suboptimal 
probabilistic  planning,  and  how  they  relate  to  this  thesis. 

•  Chapter  7  -  Empirical  Evaluation.  This  chapter  presents  an  extensive  empirical  evalu¬ 
ation  of  the  proposed  probabilistic  planners  against  the  state-of-the-art  probabilistic  plan¬ 
ners. 

•  Chapter  8  -  A  Real  World  Application:  a  Service  Robot  Searching  for  Objects.  We 

show  how  the  problem  of  an  autonomous  agent  moving  in  a  known  environment  to  find 
objects,  while  minimizing  the  search  cost,  can  be  solved  by  using  short-sighted  probabilis¬ 
tic  planning.  As  a  concrete  example,  we  use  the  problem  of  a  mobile  service  robot  that 
moves  in  a  building  to  find  an  object,  whose  location  is  not  deterministically  known,  and 
to  deliver  it  to  a  location. 
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Figure  1 .4:  Organization  of  the  chapters  in  this  thesis. 

•  Chapter  9  -  Conclusion.  We  conclude  this  dissertation  with  a  summary  of  our  contribu¬ 
tions  along  with  a  discussion  of  future  work  for  short-sighted  planning. 

Figure  1.4  illustrates  the  chapters’  organization  and  the  dependency  between  chapters  of  this 
dissertation.  All  readers  should  begin  with  Chapter  2,  which  provides  the  necessary  mathematical 
background  and  defines  the  notation  used  in  this  dissertation. 


Chapter  1:  Introduction 


Chapter  2 
Background 


This  chapter  introduces  the  Stochastic  Shortest  Path  Problems ,  the  probabilistic  planning  model 
used  in  this  dissertation.  We  begin  with  a  basic  overview  (Section  2.1)  that  follows  the  presen¬ 
tation  in  [Bertsekas,  1995]  with  small  differences  in  notation  to  match  the  planning  community 
notation.  In  Section  2.2,  we  present  the  high-level  representation  of  probabilistic  planning  prob¬ 
lems  proposed  by  the  planning  community.  Finally,  Section  2.3  reviews  two  standard  algorithms 
for  probabilistic  planning,  Real-Time  Dynamic  Programming  and  FF -Replan,  that  are  frequently 
referred  in  this  dissertation. 


2.1  Stochastic  Shortest  Path  Problem 

A  Stochastic  Shortest  Path  Problem  (SSP)  [Bertsekas  and  Tsitsiklis,  1991]  is  a  tuple 

§  =  (S,  s0,  G,  A,  P,  C),  in  which: 

•  S  is  the  finite  set  of  states; 

•  s0  G  S  is  the  initial  state; 

•  G  C  S  is  the  non-empty  set  of  goal  states; 

•  A  is  the  finite  set  of  actions; 

•  P(s'|s,  a)  represents  the  probability  that  s'  G  S  is  reached  after  applying  action  a  G  A  in 
state  s  G  S;  and 

•  C(s,a,s')  G  (0,  Too)  is  the  immediate  cost  incurred  when  state  s'  is  reached  after  ap¬ 
plying  action  a  in  state  s.  This  function  is  required  to  be  defined  for  all  s,  a,  s'  in  which 

P(s'|s,  a)  >  0. 
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In  SSPs,  an  agent  executes  actions  a  G  A  in  discrete  time  steps,  at  a  state  s  G  S.  The 
chosen  action  a  changes  state  s  to  state  s'  with  probability  P(s'\s,  a)  and  the  cost  C(s ,  a,  s')  is 
incurred.  If  a  goal  state  sG  6  G  is  reached,  the  problem  finishes,  i.e.,  no  more  actions  need  to  be 
executed.  The  sequence  of  states  T  =  (s0,  si,  s2,  .  ■ . )  visited  by  the  agent  is  called  a  trajectory 
and  the  state  sl  is  the  state  of  the  environment  at  time  step  i.  Thus,  for  every  trajectory  T,  there 
exists  at  least  one  sequence  of  actions  (a0,  ai,  a2, . . . )  such  that  a*  is  executed  in  state  s*  and 
P{T\  (a0,  «i,  a2,  •  •  • ))  =  nie{o,i,...}-P(si+ilsG«i)  >  0. 

The  horizon  is  the  maximum  number  of  actions  the  agent  is  allowed  to  execute  in  the  envi¬ 
ronment,  and  therefore  the  maximum  size  of  T.  For  SSPs,  the  horizon  is  indefinite  since,  under 
certain  conditions  discussed  later  in  this  section,  a  goal  state  can  be  reached  using  a  finite,  yet 
unbounded,  number  of  actions.  If  the  horizon  is  set  to  fmax,  then  the  obtained  model  is  known 
as  a  finite-horizon  Markov  Decision  Process  (MDP)  [Puterman,  1994].  Alternatively,  if  no  goal 
states  are  given,  then  the  horizon  becomes  infinite  since  no  stop  condition  is  given  to  the  agent. 
In  order  to  guarantee  that  the  total  accumulated  cost  is  finite  in  such  models,  the  cost  incurred 
at  time  step  t  is  discounted  by  ryt,  for  7  G  (0, 1).  The  obtained  model  is  known  as  discounted 
infinite-horizon  MDPs  [Puterman,  1994].  Both  finite-horizon  and  discounted  infinite-horizon 
MDPs  and  are  special  cases  of  SSPs  [Bertsekas  and  Tsitsiklis,  1996]. 

A  solution  to  an  SSP  is  a  policy  tt,  i.e.,  a  mapping  from  S  to  A.  We  denote  all  the  states  reach¬ 
able  from  s0  when  following  n  as  S77  C  S  and  the  set  of  states  in  which  replanning  is  necessary 
as  R77.  Formally,  Rf  =  {s  6  S  \  G |7r  is  not  defined  for  s}.  A  policy  n  can  be  classified  according 
to  S77  and  R77.  If  a  policy  n  can  be  followed  from  s0  without  replanning,  i.e.,  R77  IT  S77  =  0,  then 
7T  is  a  closed  policy.  A  special  case  of  closed  policies  is  the  complete  policies,  i.e.,  policies  that 
can  be  followed  from  any  state  s  G  S  without  replanning.  Thus,  for  any  complete  policy  tt,  we 
have  that  R77  =  0.  If  a  policy  tt  is  not  closed,  then  R77  D  S77  ^  0  and  it  is  known  as  a  partial 
policy.  For  any  partial  policy  tt,  replanning  has  non-zero  probability  of  happening,  since  every 
state  s  G  R77  fl  S77  has  non-zero  probability  of  being  reached  when  following  7 r  from  s0. 

Policies  can  also  be  classified  according  to  their  termination  guarantee.  7r  is  a  proper  policy 
if  it  is  inevitable  to  reach  a  goal  state  when  following  the  policy  tt  from  s0-  Formally: 

Definition  2.1  (Proper  policy).  A  policy  tt  is  proper  if,  for  all  s  G  S77,  there  exists  a  trajectory 
T  —  (s,  Si, . . . ,  sk)  generated  by  tt  such  that  sk  G  G  and  k  <  |S|. 

A  policy  that  is  not  proper  is  said  to  be  improper.  A  common  assumption  used  in  the  theoretical 
results  for  SSPs  is: 

Assumption  2.1.  There  exists  at  least  one  policy  that  is  both  proper  and  complete. 

By  definition,  every  proper  policy  is  closed  and  every  partial  policy  is  improper;  however, 
not  all  closed  policies  are  proper.  To  illustrate  this  relationship  between  closed  and  proper  poli- 
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Figure  2.1:  Example  of  an  Stochastic  Shortest  Path  Problem  (SSP).  The  initial  state  is  s0,  the 
goal  set  is  G  =  {sg}  and  C(s,  a ,  s')  =  1,  Vs  G  S,  a  G  A  and  s'  G  S. 


cies,  consider  the  SSP  depicted  in  Figure  2.1:  tt0  =  {(s0,  a0),  (s', ,  a0)}  is  a  proper  policy  and 
S71"0  =  {s0,  s) ,  sg};  7Ti  =  |(s0,  ai),  (si,  ai)}  is  a  partial  policy  because  7Ti(s2)  is  not  defined;  and 
7T 2  =  {(s0,  ai),  (si,  a0)}  is  a  closed  and  improper  policy  since,  no  goal  state  is  reachable  from  s0 
when  following  tt2  and  7t2  is  defined  for  S~2  =  (s0,  Si}. 

Given  a  closed  policy  n,  V7r(s)  is  the  expected  accumulated  cost  to  reach  a  goal  state  from 
state  s  G  S77.  The  function  V",  defined  at  least  over  S 7r,  is  called  the  value  function  for  n  and  is 
the  fixed  point  solution  for  the  following  system  of  equations: 


^(s) 


0  if  s  G  G 

E  [C(s,  a,  s')  +  y’VJIs,  a  =  vr(s)]  otherwise 


Vs  G  ST  (2.1) 


where  E[C(s,  a,  s')  +  Vn (s')  | s,  a]  =  ^s/gS -P(s/|s;  a)  [C(s,  a,  s')  +  ^(s')].  Another  common 
assumption  for  SSPs  is: 

Assumption  2.2.  For  every  closed  and  improper  policy  i r,  there  exists  at  least  one  state  s  G  S" 
such  that  l^(s)  is  infinite. 

This  assumption  is  already  true  in  our  definition  of  SSPs,  since  the  cost  function  C(s,  a ,  s')  is 
strictly  positive.  For  instance,  consider  the  SSP  depicted  in  Figure  2.1;  the  trajectories  generated 
by  the  closed  and  improper  policy  7 r2  =  {(s0,  ai),  (si,  a0)}  have  infinite  size  and,  at  each  time 
step,  a  strictly  positive  immediate  cost  is  incurred,  therefore  V7T2(s0)  =  V“2(s\)  =  oo. 

An  optimal  policy  n*  is  any  proper  policy  that  minimizes,  over  all  closed  policies,  the  ex¬ 
pected  cost  of  reaching  a  goal  state  from  s0,  i.e.,  l/7r‘(s0)  <  min^  s-t  ^  is  closed  ^(so).  For  a  given 
SSP,  7T*  might  not  be  unique;  however,  the  optimal  value  function  V*,  representing  for  each 
state  s  the  minimal  expected  accumulated  cost  to  reach  a  goal  state  overall  policies,  exists  and  is 
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unique  [Bertsekas  and  Tsitsiklis,  1996].  For  all  optimal  policies  n*  and  s  G  S"\  we  have  that 
V*(s)  =  V1'*  (.S');  formally,  V*  is  the  fixed  point  solution  for  the  Bellman  Equations'. 


0  if  s  G  G 

V*(s)  =  {  ,  Vs  G  S.  (2.2) 

min  E  [C(s,  a,  s')  +  H*(V)|s,  o]  otherwise 

V  ae A 

Every  optimal  policy  n*  can  be  obtained  by  replacing  min  by  argmin  in  (2.2),  i.e.,  n*  is  a  greedy 
policy  of  V*: 

Definition  2.2  (Greedy  policy).  Given  a  value  function  V,  the  greedy  policy  nl  is  such  that 
7rv  (s)  =  argrninaeA  E[C(s,  a,  s')  +  V (s') \s,  a]  for  all  s  G  S  \  G.  For  the  states  s  in  which  V  is 
not  defined,  V(s)  =  oo  is  assumed. 

A  possible  approach  to  computing  V*  is  the  value  iteration  algorithm  (VI)  [Howard,  I960]: 
given  an  initial  guess  V°  for  V*,  compute  the  sequence  (H°,  V1, . . . ,  Vk)  where  Vt+l  is  obtained 
by  performing  a  Bellman  backup  in  Vt,  that  is,  applying  the  operator  B  in  the  value  function  Vt 
for  all  s  G  S: 


Vt+1(s)  =  (W*)(s) 


if  s  G  G 


min  E  [CYs,  a,  s')  +  EVsHls,  a] 

ae  A  L  J 


otherwise 


We  denote  by  Bk  the  composition  of  the  operator  B,  i.e.,  ( BkV)(s )  =  ( B(Bk~1)V)(s )  for  all 
s  G  S;  thus,  V1  =  /i/l/(l.  Given  a  value  function  V,  BfV  represents  the  optimal  solution  for  the 
SSP  in  which  the  horizon  is  limited  to  t  and  the  extra  cost  V (s)  is  incurred  when  agent  reaches 
state  s  G  S  \  G  after  applying  t  actions.  (BtV)(s)  is  known  as  f-look-ahead  value  of  state  s 
according  to  V. 

For  SSPs  in  which  Assumption  2. 1  holds,  Vk  converges  to  V*  as  k  — >  oo  and  0  <  V*  (s)  <  oo 
for  all  s  G  S“  [Bertsekas,  1995].  In  practice,  we  are  interested  in  the  problem  of  finding 
e-optimal  solutions,  i.e.,  given  e  >  0,  to  find  a  value  function  V  that  is  no  more  that  e  away 
from  V*: 

Definition  2.3  (e-optimality).  Given  an  SSP  S,  a  value  function  V  for  S  is  e-optimal  if 


R(S>,  V )  =  rna xR(s,  V )  =  max  |E(s)  —  (BV)(s)  \  <  e, 

se  S'  se  S' 

where  S'  =  S^,  i.e.,  the  states  reachable  from  s0  when  following  the  greedy  policy  nv ;  R(s.  V) 
and  R(S,  V )  are  known  as  the  Bellman  residual  w.r.t.  V  of  the  state  s  and  the  SSP  S,  respectively. 

Any  initial  guess  V°  for  V*  can  be  used  in  VI  and  if  V°  is  a  lower  bound  of  V*,  i.e., 
E°(s)  <  V* (.S')  for  all  s  G  S,  then  V°  is  referred  as  an  admissible  heuristic.  For  any  two  value 
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functions  V  and  V ,  we  write  V  <  V'  if  V(s)  <  V'(,s)  f°r  s  G  S,  thus,  V°  is  an  admissible 
heuristic  if  V°  <  V*.  Another  important  definition  regarding  value  functions  is  monotonicity. 
Definition  2.4  (Monotonic  Value  Function).  A  value  function  V  is  monotonic  if  V  <  BV. 

The  following  well-known  result  is  necessary  in  most  of  our  proofs  in  this  dissertation: 
Theorem  2.1.  Given  an  SSP  §  in  which  Assumption  2.1  holds,  then  the  operator  B  presents 
[Bertsekas  and  Tsitsiklis,  1996,  Lemma  2.1]: 

•  admissibility:  ifV  <  V*,  then  BkV  <  V*  for  k  G  N*;  and 

•  monotonicity:  ifV  <  BV,  then  V  <  BkV  for  k  G  N*. 


2.2  Factored  Representation 

In  the  previous  section,  we  reviewed  SSPs  using  their  enumerative  representation  (also  known 
as  explicit  representation).  In  the  enumerative  representation,  the  set  of  states  S,  the  set  of  goal 
states  G,  the  set  of  actions  A  and  the  transition  probability  distributions  P(-|-,  •)  are  represented 
explicitly  by  directly  enumerating  each  element  of  them.  This  enumerative  specification  of  § 
can  be  burdensome  for  large  problems,  especially  the  encoding  of  P(-|-,  a)  as  a  matrix  S  x  S 
for  each  action  a.  Also,  in  many  cases  it  is  advantageous,  from  both  the  computational  and 
representational  perspective,  to  define  a  set  of  states  by  their  properties;  for  instance,  the  goal  for 
a  service  robot  navigating  in  a  building  could  be  compactly  represented  by  a  high-level  statement 
such  as  “the  robot  is  at  a  kitchen”. 

To  compactly  represent  large  SSPs  and  to  use  high-level  statements  to  represent  set  of  states, 
the  factored  representation  is  used  [Boutilier  et  al.,  1999].  In  the  factored  representation,  SSPs 
are  encoded  using  state  variables,  i.e.,  variables  f,  with  domain  D;:,  and  the  set  of  state  vari¬ 
ables  is  denoted  as  F  =  {/i,  •  •  •  ,  //,}.  The  cross  product  ,  D,  represents  the  state  space  S, 
thus  a  state  s  G  S  is  the  tuple  (v0f  ■  ■  ■  ,  V|f|)  where  v,  G  D,.  For  example,  the  SSP  in  Fig¬ 
ure  2.2  can  be  factored  using  two  binary  state  variables,  x  and  y,  such  that  state  ( x ,  y)  equals 
the  state  Si  for  i  =  x  -\-  2y.  For  the  rest  of  this  dissertation,  we  assume  the  domain  of  each  state 
variable  /  G  F  to  be  binary,  thus  |S|  =  2lFL 

Another  benefit  of  using  state  variables  is  a  compact  representation  of  the  transition  probabil¬ 
ities  P(  |-,  a)  using  two-stages  temporal  Bayesian  Networks  [Boutilier  et  al.,  1999].  To  illustrate 
the  space  savings  obtained  by  using  the  factored  representation  of  actions,  consider  action  a0  of 
the  SSP  depicted  in  Figure  2.2.  The  enumerative  representation  of  Pf  |-,  a0)  is  a  4-by-4  stochastic 
matrix,  which  is  encoded  with  4  x  3  =  12  numbers.  For  this  example,  a  factored  representa¬ 
tion  is  P((x',y')\(x,y),  a0)  =  P(x'\x,a0)  x  P(y'\y,  a0)  where  P(x'  =  l\x  =  0,a0)  =  0.25, 


14 


Chapter  2:  Background 


Figure  2.2:  Example  of  a  factored  SSP.  The  initial  state  is  s0,  the  goal  set  G  =  {s3}  and 
C(s,  a ,  s')  =  1  for  all  s  G  S,  a  e  A,  s'  G  S.  This  SSP  can  be  represented  as  a  factored  SSP  with 
two  binary  state  variables,  x  and  y,  such  that  the  state  (x,  y)  equals  the  state  st  for  i  =  x  +  2 y. 

( : action  aO 

:effect  (and  (y)  (prob  0.25  (x)  0.75  (not  (x) ) ) ) 

) 

(  : action  al 

: precondition  (not  (x) ) 

:effect  (x) 

) 

Figure  2.3:  Example  of  PPDDL  representation  of  the  actions  of  the  SSP  in  Figure  2.2.  Note  that 
only  action  al  has  a  precondition. 


P(x'  —  l\x  —  1,  a0)  =  1  and  P(y'  =  1| y  =  0,  a0)  =  P(y'  —  l\y  —  1,  a0 )  =  1,  which  can  be 
encoded  with  only  4  numbers. 

The  Probabilistic  Planning  Domain  Description  Language  (PPDDL)  [Younes  and  Littman, 
2004]  is  a  standard  language  to  represent  factored  SSPs  that  is  used  in  the  international  proba¬ 
bilistic  planning  competitions  (IPPC)  [Younes  et  al.,  2005,  Bonet  and  Givan,  2007,  Bryce  and 
Buffet,  2008].  PPDDL  syntax  is  based  on  LISP  and  an  action  a  consists  of  a  precondition,  that 
is,  a  formula  over  the  state  variables  characterizing  the  states  in  which  a  is  applicable,  and  an 
effect.  The  effect  describes  how  the  states  variables  change  when  a  is  applied.  Any  state  variable 
not  explicitly  modified  by  a  remains  unchanged  after  executing  a  (frame  assumption).  Figure  2.3 
contains  the  PPDDL  representation  of  the  actions  ao  and  a\  of  the  SSP  represented  in  Figure  2.2. 

PPDDL  also  features  predicates  and  action  schemas.  These  extensions  use  the  concept  of  do¬ 
main  variables,  i.e.,  class  of  finite  objects.  A  predicate  is  mapping  from  a  value  assignment  of  one 
or  more  domain  variables  to  a  state  variables.  For  instance,  we  can  model  a  graph  G  =  (IV,  E)  by 
using  a  domain  variable  called  node  in  which  its  domain  is  N  and  edges  of  the  graph  as  the  pred¬ 
icate  edge(«,  j)  where  i  and  j  are  domain  variables  of  the  type  node;  in  this  case,  each  possible 
instantiation  of  edge(z,  j)  represents  one  binary  state  variable.  Therefore,  if  the  planning  problem 
defines  three  objects  of  the  type  node,  namely  711,77,2,713,  then  six  state  variables  are  instantiated 
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representing  the  edges  (ni,  n2),  (ni,  n3),  (n2,  ni), . . . ,  (n3,  n2).  Similarly  to  predicates,  action 
schemas  map  value  assignment  of  one  or  more  domain  variables  to  an  action. 

2.3  Relevant  Probabilistic  Planning  Algorithms 

In  this  section,  we  review  two  algorithms  for  solving  SSPs:  Real-Time  Dynamic  Programming 
(Section  2.3.1)  that  uses  dynamic  programming  and  sampling  in  order  to  compute  optimal  closed 
policies;  and  FF -Replan  (Section  2.3.2)  that  relaxes  SSPs  into  deterministic  problems  and  returns 
partial  policies. 

2.3.1  Real-Time  Dynamic  Programming 

Real-Time  Dynamic  Programming  (RTDP)  [Barto  et  al.,  1995]  is  an  extension  of  Learning-Real- 
Time-A*  [Korf,  1990]  for  probabilistic  planning  problems.  RTDP  computes  closed  policies 
instead  of  complete  policies,  and,  since  a  closed  policy  tt  is  defined  only  for  the  states  in  S’1-  C  S, 
RTDP  converges  to  the  e-optimal  solution  faster  than  VI  when  IS77)  <C  |S|. 

RTDP,  presented  in  Algorithm  2.1,  simulates  the  current  greedy  policy  7rl  (Line  13)  to  sam¬ 
ple  trajectories  from  the  initial  state  s0  to  a  goal  state.  Each  trajectory  is  sampled  by  the  procedure 
RTDP-Trial  (Line  7):  while  the  current  state  s  is  not  a  goal  state,  the  greedy  action  a  w.r.t.  V (s) 
is  chosen;  a  Bellman  backup  is  applied  on  s;  and  a  resulting  state  of  applying  a  on  s  is  sampled 
(Lines  10  to  13).  The  value  function  V  is  initialized  by  the  input  heuristic  H  (Line  3)  using  a  lazy 
approach,  i.e.,  if  the  value  V ( s )  is  requested  and  V  is  not  defined  on  s,  then  H(s)  is  computed 
(on  demand)  and  assigned  to  V (s). 

Since  the  greedy  selection  of  actions  is  interleaved  with  updates  on  V,  RTDP-Trial  cannot 
be  trapped  in  loops  and  always  reaches  a  goal  state.  Lormally,  if  Assumption  2.1  holds  for  the 
given  SSP,  then  RTDP-Trial  always  terminates.  Moreover,  if  Assumption  2.1  holds  and  the 
heuristic  H  used  is  also  admissible,  then  RTDP  always  converges  to  the  optimal  solution  V*,  i.e., 
R( §,  V )  =  0,  after  several  calls  of  RTDP-Trial  (possibly  infinitely  many)  [Barto  et  al.,  1995, 
Theorem  3,  p.  132]. 

2.3.2  FF-Replan 

LL-Replan  [Yoon  et  al.,  2007]  is  a  replanner  based  on  determinization,  i.e.,  a  relaxation  of  a  given 
SSP  §  =  (S,  s0,  G,  A.  P ,  C)  into  a  deterministic  problem  D  =  (S,  s0,  G,  A').  The  set  A'  contains 
only  deterministic  actions  represented  as  a  =  s  s',  i.e.,  a  deterministically  transforms  s  into  s'. 
Two  common  determinization  techniques  are: 
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RTDP(SSP  §  =  (S,  s0,  G,  A,  P,  C ),  H  a  heuristic  for  V*,  e  >  0) 

begin 

1/  -t—  Value  function  for  S  with  default  value  given  by  II 

while  R( S,  V)  >  e  do 
|  V  <-  RTDP-Trial(S,  V) 


6  return  V 


7  RTDP-Trial(SSP  §  =  (S,  s0,  G,  A,  P,  C),  value  function  V ) 

8  begin 

9  S  i —  5q 

10  while  s  ^  G  do 

11  a  <—  7TV  (s) 

12  V(s)  <-  (BV)(s) 

o  s  i —  Apply- AcTiON(a,s) 

14  return  V 

Algorithm  2.1:  Real-Time  Dynamic  Programming  (RTDP)  [Barto  et  al.,  1995]. 


•  most-likely  outcome,  in  which  A7  =  {s  — »  s'|3a  G  A  s.t.  s'  =  argmax-  P(s|s,  a)}  (break¬ 
ing  ties  randomly);  and 

•  all-outcomes,  where  A7  =  {s  — *  s7|3o  G  A  s.t.  P(s7|s,  a)  >  0}. 

The  idea  behind  FF-Replan  (Algorithm  2.2),  is  simple  and  powerful:  relax  the  probabilistic 
problem  into  a  deterministic  problem  D  (Line  7)  and  use  the  deterministic  planner  FF  [Hoffmann 
and  Nebel,  2001]  to  solve  D  (Line  8).  FF-Replan  stores  the  obtained  solution  for  D  in  the 
policy  7T  (Line  10)  and  there  is  no  guarantee  that  7r  is  a  closed  policy  for  the  original  SSP  §,  that 
is,  7T  might  be  a  partial  policy  for  S.  The  policy  7r  is  followed  until  failure  (Line  11),  i.e.,  7r  is  not 
defined  for  the  current  state  s;  if  and  when  it  fails,  FF  is  re-invoked  to  plan  again  from  the  failed 
state. 

An  earlier  version  of  FF-Replan  employed  the  most-likely  outcome  determinization  [Yoon 
et  al.,  2007];  however  this  approach  is  not  complete  since  the  goal  might  not  reachable  in  the 
most-likely  determinization  of  §  even  when  Assumption  2.1  holds  for  §.  Alternatively,  if  the  all¬ 
outcomes  determinization  is  used,  then  FF-Replan  is  complete  when  Assumption  2.1  holds,  i.e., 
FF-Replan  always  reaches  a  goal  state.  In  this  dissertation,  we  consider  the  most  recent  version 
of  FF-Replan,  i.e.,  when  the  all-outcomes  determinization  is  used. 

FF-Replan  is  the  winner  of  the  first  International  Probabilistic  Planning  Competition  (IPPC) 
[Younes  et  al.,  2005]  in  which  it  outperformed  the  probabilistic  planners  due  to  their  poor  scala¬ 
bility.  In  general,  FF-Replan  can  quickly  reach  a  goal  state  and  scales  up  to  large  problems  when 
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FF-Replan(SSP  §  =  (S,  s0,  G,  A,  P,  C )) 

begin 

7 r  F-  empty-policy 

S  ■<—  So 

while  s  ^  G  do 

if  7T  A  not  defined  for  s  then 

D  f-  All-Outcomes-Determinization(S) 
(si,  ai,  s2, .  ■  • ,  Ofe-1,  Sfc)  F-  FF(D,  s) 
foreach  i  e  {1, . . . ,  k  —  1}  do 
7r(sj)  F-  ai 


u 


s  f-  Apply-Action(7t(s),s) 


Algorithm  2.2:  FF-Replan  [Yoon  et  al.,  2007].  On  Line  8,  the  deterministic  planner  FF  [Hoff¬ 
mann  and  Nebel,  2001]  is  called  to  compute  a  sequence  of  states  and  actions  starting  from 
the  current  state  s  that  reaches  a  goal  state  €  G.  Different  determinization  approaches  and 
deterministic  planners  can  be  used  in  Lines  7  and  8,  respectively. 


Assumption  2.1  holds.  Despite  its  major  success,  FF-Replan  is  non-optimal  and  oblivious  to 
probabilities  and  dead  ends,  leading  to  high-cost  solutions  and  poor  performance  in  probabilistic 
interesting  problems  [Little  and  Thiebaux,  2007],  e.g.,  the  triangle  tire-domain. 


2.4  Summary 

In  this  chapter  we  described  Stochastic  Shortest  Path  Problems  (SSPs),  the  framework  used  in 
this  dissertation  to  represent  probabilistic  planning  problems.  We  also  presented  the  main  defi¬ 
nitions  and  results  regarding  the  solutions  of  SSPs  that  are  necessary  for  our  proofs  in  this  dis¬ 
sertation.  We  described  how  to  compactly  represents  SSPs  through  the  factored  representation 
and  the  PPDDL  language,  a  standard  language  from  the  planning  community  to  represent  prob¬ 
abilistic  planning  problems.  Finally,  we  reviewed  two  main  algorithms  to  solve  SSPs:  RTDP,  an 
optimal  probabilistic  planner  that  returns  closed  policies;  and  FF-Replan,  a  replanner  based  on 
determinizations.  In  the  next  chapter,  we  exam  how  to  combine  the  main  features  of  both  RTDP 
and  FF-Replan,  i.e.,  optimally  and  scalability. 
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Chapter  3 

Short-Sighted  Probabilistic  Planning 


In  this  chapter,  we  present  the  two  main  concepts  for  short-sighted  probabilistic  planning,  i.e., 
how  to  generate  short-sighted  problems  and  how  to  plan  using  short-sighted  problems  [Trevizan 
and  Veloso,  2012a,  Trevizan  and  Veloso,  2013].  We  begin  by  comparing  RTDP  and  FF-Replan  to 
motivate  the  definition  of  short-sighted  problems.  We  then  formally  define  the  concept  of  short¬ 
sighted  problems  in  Section  3.2  and  prove  its  properties  with  respect  to  the  original  probabilistic 
planning  problem  in  Section  3.2.1.  In  Section  3.3,  we  present  the  Short-Sighted  Probabilistic 
Planner  algorithm  that  solves  probabilistic  planning  problems  using  the  short-sighted  problems 
defined  previously.  The  properties  of  this  algorithm,  e.g.,  optimality,  are  proven  in  Section  3.3.1. 
We  empirically  demonstrate  the  benefits  of  Short-Sighted  Probabilistic  Planner  against  RTDP 
and  FF-Replan  in  Section  3.4  using  a  proposed  series  of  increasingly  larger  problems. 


3.1  Motivation 

In  order  to  motivate  the  introduction  of  short-sighted  problems ,  consider  the  problem  of  building 
a  domino  line.  Precisely,  given  3  dominoes,  the  goal  is  to  build  a  straight  line  using  all  the  3 
dominoes.  A  domino  can  be  placed  at  position  l  e  {0, 1,  2}  of  the  line  if  /  is  empty,  and,  with 
probability  0.9,  the  new  domino  falls  and  drops  all  the  other  dominoes  already  in  the  line.  The 
cost  of  action  place  is  1  independently  of  its  outcome.  Also,  the  special  action  delegate,  which 
delegates  the  construction  of  the  domino  line  to  a  more  reliable  agent,  is  available  when  the 
line  is  empty.  Delegate  deterministically  builds  the  complete  line  of  3  dominoes  at  a  cost  of  9. 
Figure  3.1  depicts  this  problem. 

Due  to  the  all-outcomes  determinization  (Section  2.3.2),  FF-Replan  considers  that  it  is  pos¬ 
sible  to  deterministically  place  each  one  of  the  dominoes  and  this  relaxed  action  costs  1,  since 
the  original  action  place  has  cost  1.  Therefore  FF-Replan  solves  this  problem  by  building  the 
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Figure  3.1:  Domino  line  problem  for  n  =  3.  The  initial  state  is  the  empty  line  and  the  goal  is 
to  build  a  line  of  3  dominoes.  The  full-line  arrows  depict  the  action  place  that  succeed  with 
probability  0.1,  otherwise  (probability  0.9)  all  dominoes  in  the  line  are  dropped  (for  ease  of 
presentation,  this  side-effect  is  depicted  as  a  ball-ended  line).  The  action  delegate  is  shown  as 
a  dashed  arrow. 

domino  line  piece-by-piece  at  a  total  relaxed  cost  of  3.  However,  in  the  original  problem,  the 
expected  cost  of  this  solution  is  1110  (the  formal  analysis  of  the  expected  cost  is  provided  in 
Section  3.4). 

Alternatively,  RTDP  samples  several  trajectories  from  the  initial  state  (empty  dominoes  line) 
to  the  goal  (3-dominoes  line).  Initially,  these  sampled  trajectories  contains  only  the  action  place, 
since  it  costs  1  and  delegate  costs  9.  After  a  large  amount  of  samples,  RTDP  leams  that  building 
the  dominoes  line  piece-by-piece  is  more  expensive  on  expectation  than  using  action  delegate, 
i.e.,  expected  cost  of  1110  versus  constant  cost  of  9,  and  selects  delegate,  the  optimal  solution 
for  this  problem. 

Notice  that  RTDP  is  forced  to  explore  the  whole  state  space,  i.e.,  all  the  combinations  of  1,  2 
and  3  dominoes  placements  before  inferring  that  delegate  is  the  optimal  action.  However,  the 
expected  cost  of  successfully  placing  the  first  domino  is  already  larger  than  the  cost  of  delegate. 
Precisely,  the  expected  cost  c  of  successfully  placing  the  first  domino  is  c  =  1  +  0.9c  =  10. 
Therefore,  if  we  divide  this  problem  of  building  a  3-dominoes  lines  into  3  subproblems,  namely, 
building  a  line  of  1  domino,  then  a  line  of  2,  and  finally  a  line  of  3  dominoes,  we  would  be  able 
to  infer  that  delegate  is  the  optimal  solution  after  solving  only  first  subproblem,  i.e.,  building  a 
line  of  1  domino. 

In  the  remainder  of  this  chapter,  we  introduce  short-sighted  problems,  a  novel  definition  of 
subproblems  of  probabilistic  planning  problems  in  which  actions  are  not  simplified,  therefore 
the  expected  cost  of  each  subproblem  can  be  computed.  We  then  show  how  to  use  short-sighted 
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problems  in  order  to  efficiently  solve  probabilistic  planning  problems.  We  also  revisit  the  domino 
example  in  Section  3.4  to  show  the  trade-offs  of  short-sighted  planning. 


3.2  Short-Sighted  Stochastic  Shortest  Path  Problems 

In  this  section,  we  define  depth-based  short-sighted  Stochastic  Shortest  Path  Problems,  a  special 
case  of  Stochastic  Shortest  Path  Problems  (SSPs)  in  which  the  original  problem  is  transformed 
into  a  smaller  one  by: 

•  pruning  the  states  that  have  a  zero  probability  of  being  reached  using  at  most  t  actions; 

•  adding  artificial  goal  states;  and 

•  incrementing  the  cost  of  reaching  artificial  goals  by  a  heuristic  value  in  order  to  guide  the 
search  towards  the  goals  of  the  original  problem. 

Throughout  this  chapter,  we  refer  to  depth-based  short-sighted  Stochastic  Shortest  Path  Prob¬ 
lems  as  short-sighted  SSPs  and  before  formally  introduce  them,  we  need  to  define  the  action- 
distance  between  states: 

Definition  3.1  (S(s,  s')).  The  non-symmetric  distance  S(s,  s')  between  two  states  s  and  s'  is: 

f  0  if  s  —  s' 

<S(s,s')=<^ 

I  1  +  min  min  <5(s,s)  otherwise 

V  aE A  s :  P(s\s,a)>0 

5(s,  s')  is  equivalent  to  the  minimum  number  of  actions  necessary  to  reach  s'  from  s  in  the  all¬ 
outcomes  determinization. 


Using  the  action-distance  function  5  (Definition  3.1),  the  short-sighted  SSP  associated  to  an 
SSP  is  defined  as: 

Definition  3.2  (Short-Sighted  SSP).  Given  an  SSP  §  =  (S.  s0,  G,  A.  P,C),  a  state  s  G  S,  t  G  N* 
and  a  heuristic  H,  the  (s,  t) -short-sighted  SSP  SSjt  =  (SS)t,  s,  GS)t,  A.  P,  CS)t)  associated  with  §  is 
defined  as: 


•  S S)t  =  {s'  G  S|<5(s,s')  <  t}; 

•  GS)t  =  {s'  G  S|<5(s,  s')  =  i}U(Gn  SSjt); 


CS:t(s',a,s")  = 


C(s ',  a,  s")  +  H(s")  if  s”  e  Gs,t  \  G 


,  Vs'  G  S s"  G  Ss>t, 

C (s',  a,  s'')  otherwise 

For  simplicity,  when  the  heuristic  H  is  not  clear  by  context  nor  explicit,  then  H(s)  = 
s  G  S. 


ci  G  A 

0  for  all 
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Figure  3.2:  Example  of  (s,  f) -depth-based  short-sighted  SSPs  for  the  3-line  dominoes  problem 
(Figure  3.1).  For  both  (a)  and  (b)  the  parameter  s  equals  the  initial  state  (no  dominoes)  and  t 
equals  to  1  and  2  for  (a)  and  (b)  respectively.  The  action  delegate  is  shown  as  a  dashed  arrow 
and  ball-ended  lines  represent  the  side-effect  of  place  in  which  all  dominoes  pieces  are  dropped, 
i.e.,  transition  to  the  initial  state. 


Figure  3.2  shows  the  (s0, 1)  and  (s0,  2) -short- sighted  SSP  associated  with  the  3-dominoes 
line  example  (Figure  3.1)  where  s0  represents  the  initial  state  (i.e.,  no  dominoes).  The  state 
space  SSjt  of  (s,  t)-short-sighted  SSPs  is  a  subset  of  the  original  state  space  in  which  any  state 
s'  G  S S:t  is  reachable  from  s  using  at  most  t  actions.  Given  a  short-sighted  SSP  §Sjt,  we  refer 
to  the  states  s'  G  GStt  \  G  as  artificial  goals  and  we  denote  the  set  of  artificial  goals  by  Ga,  thus 

Ga  =  G Stt  \  G. 

The  key  feature  of  short-sighted  SSPs  that  allows  them  to  be  used  for  solving  SSPs  is  given 
by  the  definition  of  CStt-  every  artificial  goal  state  sa  G  Ga  has  its  heuristic  value  II (sa)  added 
to  the  cost  of  reaching  sa.  Therefore,  the  search  for  a  solution  to  short-sighted  SSPs  is  guided 
towards  the  goal  states  of  the  original  SSP,  even  if  such  states  are  not  in  Ss,t- 


3.2.1  Properties 

Since  short-sighted  SSPs  are  also  SSPs,  the  optimal  value  function  for  Ss,t,  denoted  as  V§  , 
is  defined  by  (2.2).  Although  related,  the  V§  (s)  and  ( BtH)(s ),  i.e.,  the  t-look-ahead  value 
of  s  w.r.t.  H  (Section  2.1  p.12),  are  not  the  same.  Before  we  formally  prove  their  differences, 
consider  the  3-dominoes  line  problem  depicted  in  Figure  3.1,  depth  t  —  2,  and  the  zero-heuristic 
as  H: 
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Goal 


Figure  3.3:  Example  of  look-ahead  search  tree  for  the  3-line  dominoes  problem  (Figure  3.1).  In 
this  example,  the  root  of  the  search  tree  is  the  initial  state  s0  and  the  depth  is  t  —  2.  Ball-ended 
lines  represent  a  transition  to  the  state  in  which  the  line  is  empty  in  the  next  level  of  the  tree;  this 
transition  happens  with  probability  0.9. 


•  The  2-look-ahead  search  from  s0,  (B2H)(s0),  represents  the  minimum  expected  cost  of 
executing  2  actions  in  a  row,  therefore  only  trajectories  of  size  2  are  considered.  Figure  3.3 
shows  the  search  tree  associated  with  ( B2H)(s0 ).  The  resulting  value  is  ( B2H)(s0 )  =  2 
that  is  obtained  by  applying  any  sequence  of  two  place  actions,  since  delegate  has  cost 
9. 

•  The  optimal  value  function  for  Sso  2  on  s0,  2  (s0),  is  defined  as  the  minimum  expected 

cost  to  reach  a  goal  state  in  §S0!2  (Figure  3.2(b)),  i.e.,  a  state  in  GS0j2,  from  s0-  Thus  all 
possible  trajectories  in  SS0)2  are  considered  and  the  maximum  size  of  these  trajectories  is 
unbounded  due  to  the  loops  generated  by  the  policy  in  which  the  action  place  is  applied.  In 
this  example,  14*  ,(s0)  =  9  and  the  closed  greedy  policy  w.r.t.  14*  is  to  apply  delegate 
in  the  initial  state. 

Precisely,  the  difference  between  the  look-ahead  and  short-sighted  SSPs  is  in  how  the  original 
SSP  is  relaxed:  look-ahead  changes  the  indefinite  horizon  of  the  original  SSP  to  a  finite  horizon; 
and  short-sighted  SSPs  prune  the  state  space  of  the  original  SSP  without  changing  the  horizon. 

In  order  to  formally  prove  the  relation  between  14*  f(s)  and  (74//)  (4),  we  introduce  Bs  t, 
the  Bellman  operator  B  applied  to  the  short-sighted  SSP  §Sjt.  To  simplify  our  proofs,  we  define 
(BsjV)(s)  to  be  equal  to  0  if  s  G  Gsd  and. 


E  P(s'\s,a)  [CS)t(s,a,s')  +  V'(s')]  +  E  P(s'\§’  a> s>) 

s/GSs,t\Ga  s'EGa 

for  all  s  G  S  s<t  \  Gs  /.  The  only  difference  between  the  definitions  of  B  and  Bs  t  is  the  explicit 
treatment  of  the  states  sa  G  Ga  in  the  summation  by  Bhj  \  V(sa )  is  not  considered  since  sa  is 
an  artificial  goal  of  §S)t.  If  V(s0)  =  0  for  all  sa  G  Ga,  then  B V  =  Bs  tV  for  Femmas  3.1 


(BstV)(s)  =  min 

a£  A 
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and  3.2  relate  the  operator  B  applied  to  an  SSP  §  with  operator  BsJ  applied  to  the  (s,t) -short¬ 
sighted  SSP  §Sjt  associated  with  §. 

Lemma  3.1.  Given  an  SSP  §  =  (S.  s0)  G.  A.  P.  C)  that  satisfies  the  Assumption  2.1,  s  £  S, 
t  eW  and  a  monotonic  value  function  V  for  E>,  then  (BktV)(s)  =  (BkV)(s)  for  all  s  €  SS)t\Ga 
s.t.  minSaeGa  S(s,  sa)  >  k,  where  B  and  Bs  t  represent,  respectively,  the  Bellman  operator 
applied  to  §  and  SS]t. 

Proof.  See  Appendix.  □ 

Lemma  3.2.  Under  the  same  conditions  of  Lemma  3.1,  (BktV)(s)  <  (BkV)(s)  for  all  k  e  N* 
and  s  G  S sp  where  B  and  Bs  t  represent,  respectively,  the  Bellman  operator  applied  to  §  and 

§s,t- 

Proof.  See  Appendix.  □ 

In  Theorem  3.3,  we  prove  that  V4  t(.s)  <  V* (s)  and  that  14*  t(s)  is  a  lower  bound  for  V* (s) 
at  least  as  tight  as  ( BtH)(s )  if  H  is  a  monotonic  lower  bound  on  V*  and  Assumption  2.1  holds 
for  S.  Corollary  3.4  shows  that  V§  (s)  is  always  a  tighter  lower  bound  than  ( BtH)(s )  if  §  has 
unavoidable  loops  (Definition  3.3). 

Theorem  3.3.  Given  an  SSP  S  =  (S,  s0,  G,  A,  P,  C)  that  satisfies  the  Assumption  2.1,  s  G  S, 
f  6  N*  and  a  monotonic  lower  bound  H  for  V*,  then 

(BtH)(s)  <VfJs)  <V*(s). 

Proof.  By  the  definition  of  minSaGGa  <S(s,  sa)  =  t.  Therefore  ( BtH)(s )  =  ( Bl  tH)(s )  by 
Lemma  3.1.  Since  H  is  a  monotonic  lower  bound  and  Vf  (s)  =  (lim^^  BktH)(s),  we  have 
that  ( BtH)(s )  <  Vgs  t(s).  By  Lemma  3.2,  we  have  that  14*  t(s)  <  V*(s). 

□ 

Definition  3.3  (Unavoidable  Loops).  An  SSP  S  =  (S,  s0,  G.  A.  P,  C)  that  satisfies  the  Assump¬ 
tion  2. 1  has  unavoidable  loops  if,  for  every  optimal  policy  n*  of  S,  the  directed  graph  G  =  (S71"* ,  E) , 
where  E  =  {(s,  s')|P(s'|s,  7r*(s))  >  0},  is  not  acyclic. 

Corollary  3.4.  In  Theorem  3.3,  if  the  (s,  t)-short-sighted  SSP  S.s,/  has  unavoidable  loops  (Defi¬ 
nition  3.3),  then  ( BtH)(s )  <  14*  t(.s). 

Proof.  By  definition,  ( BtH)(s )  consider  only  trajectories  of  size  at  most  t  from  s.  By  definition, 
V§  (s)  =  linifc^oc (BgtH)(s),  then  all  possible  trajectories  on  are  considered  by  U§*  t.  By 
assumption,  S.s,/  has  unavoidable  loops,  therefore  the  maximum  size  of  a  trajectory  generated 
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by  7r*  t  is  unbounded.  Since  every  trajectory  has  non-zero  probability  and  non-zero  cost,  then 

(B'H) (a)  =  (B‘slH)(s)  <  V£Js).  □ 

Another  important  relation  between  SSPs  and  short-sighted  SSPs  is  through  their  policies. 
To  formalize  this  relationship,  we  first  define  the  concept  of  t-closed policy  w.r.t.  s,  i.e.,  policies 
that  can  be  executed  from  s,  independently  of  the  probabilistic  outcome  of  actions,  for  at  least 
t  actions  without  replanning: 

Definition  3.4  (t-closed  policy).  A  policy  n  for  an  SSP  §  =  (S.  s0,  G,  A,  P,  C)  is  t-closed  w.r.t.  a 
state  s  G  S  if,  for  all  s'  G  R77  fi  S77,  S(s,  s')  >  t. 

FF-Replan  and  its  extensions  (see  Chapter  6)  compute  1-closed  policies  w.r.t.  the  current 
state,  i.e.,  there  is  no  guarantee  that  partial  policy  computed  by  them  can  be  executed  for  more 
than  one  action  without  replanning.  Notice  that,  when  t  — >  oo,  t-closed  policies  w.r.t.  s0  are 
equivalent  to  closed  policies.  Proposition  3.5  gives  an  upper  bound  on  t  for  when  a  t-closed 
policy  w.r.t.  s0  becomes  a  closed  policy. 

Proposition  3.5.  Given  an  SSP  §  =  (S.  so,  G,  A.  P,  C),for  t  >  |S|,  every  t-closed  policy  w.r.t.  s0 
for  §  is  also  a  closed  policy  for  §. 

Proof.  Since  it  is  t-closed  w.r.t.  s0  for  t  >  |S|,  then,  for  all  s'  G  R77  (T  S77,  S(s,s')  >  |S|.  By 
the  definition  of  S77,  we  have  that  all  s'  G  S"  is  reachable  from  s0  when  following  it.  Thus 
5(s0,  s')  <  |S|,  since  there  exist  a  trajectory  from  s0  to  s  that  visits  each  state  at  most  once,  i.e., 
that  uses  at  most  |S|  —  1  actions.  Therefore  R77  (T  S77  =  0,  i.e.,  it  is  a  closed  policy,  since  there 
exists  no  s'  G  S77  such  that  <5(s,  s')  >  |S|.  □ 

Policies  for  SSPs  and  policies  for  their  associated  (s,  t)-short-sighted  SSPs  are  related  through 
the  concept  of  t-closed  policies  w.r.t.  s : 

Proposition  3.6.  Given  an  SSP  §  =  (S.  s0,  G,  A.  P,  C)  and  a  state  s  G  S,  7r  is  a  closed  policy  for 
S  S)t  if  and  only  ifn  is  a  t-closed  policy  w.r.t.  s  for  §. 

Proof.  We  assume  that  n  is  a  closed  policy  for  §Sit,  i.e.,  (T  S”t  =  0.  For  contradiction 
purposes,  suppose  that  there  exists  s'  G  R77  (T  S77  such  that  5(s,  s')  <  t.  Since  5(s,  s')  <  t,  then 
s'  G  S Stt;  thus  s'  G  C  S77  and  s'  G  RJt  C  R77.  This  is  a  contradiction  because  R^t  fl  Sjt  =  0, 
therefore,  for  all  s'  G  R77  fl  S77,  5(s,  s')  >  t,  i.e.,  it  is  t-closed  w.r.t.  s  for  S. 

Now,  we  assume  that  i r  is  t-closed  w.r.t.  s  for  S,  i.e.,  for  all  s'  G  R77  fl  S77,  5(s,  s')  >  t.  By  the 
definition  of  §S)t,  we  have  that,  for  all  s'  G  SS)t,  5(s,  s)  <  t.  Thus,  if  s'  G  (R77  fl  S77)  fl  SSjt,  then 
<5(s,  s')  =  t,  i.e.,  s'  G  Gs,t  \  G.  Since,  by  the  definition  of  R77,  RJt  fl  GS}t  =  0  and  t  =  S77  fl  SSjt, 
then  RJt  fl  Sj t  =  0,  i.e.,  it  is  a  closed  for 
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Non-Learning-Planner(SSP§  =  {S,s0,G,A,P,C),t  e  N*,H  a  heuristic  for  V*) 

begin 

s  <—  So 

while  s  qL  G  do 

j  SS)t  <—  Generate-Short-Sighted-SSP(§,  s,  H,  t ) 


7T§m  A-  SSP-SOLVER(SS;t) 


while  s  qL  Gs >t  do 

[_sf-  Apply- AcTiON(7r§Sit(s),s) 


Algorithm  3.1:  Non-learning  algorithm  to  solve  SSPs  using  short-sighted  SSPs.  Any  proba¬ 
bilistic  planner  can  be  used  as  SSP-Solver,  e.g.,  value  iteration,  FF-Replan,  and  RTDP 


□ 


3.3  Short-Sighted  Probabilistic  Planner 

We  present  a  step  towards  the  definition  of  our  main  probabilistic  planner  by  describing  its  ba¬ 
sic  non-leaming  version.  This  algorithm,  Non-Learning-Planner  (Algorithm  3.1),  is  the 
straightforward  adaptation  of  Proposition  3.6:  a  short-sighted  SSP  is  generated,  solved  and  its 
solution  is  applied  in  the  original  SSP  (Lines  5  to  8);  if  an  artificial  goal  is  reached,  then  the 
procedure  is  repeated. 

Non-Learning-Planner  makes  no  assumption  about  the  algorithm  used  as  SSP-Solver 
(Line  6)  and  its  behavior  is  highly  dependant  on  the  chosen  algorithm  to  solve  each  short¬ 
sighted  SSP.  For  instance,  consider  the  3-dominoes  line  problem  (Figure  3.1),  FF-Replan  as 
SSP-Solver  and  H0  as  heuristic,  then,  independently  of  the  value  of  t  and  the  current  state 
s,  the  solution  returned  by  FF-Replan  to  §S)t  is  always  a  sequence  of  place  actions.  There¬ 
fore,  Non-Learning-Planner  using  FF-Replan  is  unable  to  find  the  optimal  solution  for 
3-dominoes  line  problem. 

In  order  to  illustrate  the  need  for  an  algorithm  that  learns ,  i.e.,  improves  the  given  heuristic 
as  execution  (or  simulation)  is  performed,  consider  the  3-dominoes  line  problem  with  the  cost  of 
delegate  changed  from  9  to  11.  If  Non-Learning-Planner,  using  RTDP  as  SSP-Solver, 
t  —  1  and  //0  as  heuristic,  is  applied  to  this  modification  of  the  3-dominoes  line  problem,  then 
after  §S0;i  (Figure  3.2(a))  is  solved,  we  have  that  ^so)  =  10  and  a  place  action  is  chosen. 
Every  time  that  the  initial  state  sq  is  revisited,  a  high  probability  event  since  every  place  action 
results  in  s0  with  probability  0.9,  the  same  bound  is  computed  by  RTDP,  i.e.,  RTDP  is  reinvoked 
to  solve  §SOii  generated  using  H0  as  heuristic.  Therefore,  Non-Learning-Planner  is  unable 
to  infer  that  delegate  is  the  optimal  solution  and  always  chooses  a  place  action  on  sq. 
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1  SSlPP(SSP  §  =  (S,  s0,  G,  A,  P,  C ),  t  G  N*,  H  a  heuristic  for  V*,  e  >  0) 

2  begin 


V  <—  Value  function  for  S  with  default  value  given  by  H 

s  <—  So 

while  s  ^  G  do 

S.S)t  <—  Generate-Short-Sighted-SSP(§,  s,  V,  t ) 
t,  VI  )  <—  e-OPTiMAL-SSP-SOLVER(§s,t,  V,  e) 


8  foreach  s'  G  4  \  Gs  t  do 

9 


10 

11 


while  s  £  GS:t  do 
|  s  g-  Apply- Action(7t|s  t(s),s) 


12 


return  V 


Algorithm  3.2:  Short-Sighted  Probabilistic  Planner  (SSiPP).  Any  SSP  e-optimal  solver  can  be 
used  as  e-OPTlMAL-SSP-SOLVER,  e.g.,  value  iteration  and  RTDP  Notice  that  f  returned 
by  e-OPTlMAL-SSP-SOLVER  needs  to  be  defined  only  for  the  states  reachable  from  s  when 
following  7T§s  t,  i.e.,  for  s'  G  S^.4. 


Short-Sighted  Probabilistic  Planner  (SSiPP),  presented  in  Algorithm  3.2,  overcomes  the  draw¬ 
backs  of  Non-Learning-Planner  by  maintaining  a  lower  bound  V  for  V*  that  is  updated  ac¬ 
cording  to  the  optimal  solution  of  the  generated  short-sighted  SSPs  (Lines  8  and  9).  1  The  lower 
bound  V  is  initialized  by  the  input  heuristic  II  (Line  3)  using  a  lazy  approach,  i.e.,  if  the  value 
V(s)  is  requested  and  V  is  not  defined  on  s,  then  H (s)  is  computed  (on  demand)  and  assigned 
to  V(s). 

Due  to  the  reduced  state  space  of  short-sighted  SSPs,  it  possible  to  compute  the  e-optimal 
solution  of  each  §S)t  efficiently  (Line  7)  and  the  obtained  policy  7r|s  f  is  a  /-closed  policy  w.r.t.  the 
current  state  s  for  original  SSP  S  (Proposition  3.6).  Therefore  7r§s  t  can  be  simulated  or  directly 
executed  in  the  environment  (Line  1 1)  for  at  least  t  steps  before  replanning  is  needed,  i.e.,  before 
another  short-sighted  SSP  is  generated  and  solved. 

To  illustrate  the  execution  of  SSiPP,  let  us  revisit  the  modified  3-dominoes  lines  problem 
in  which  delegate  has  cost  11.  For  this  example,  consider  SSiPP  using  RTDP  as  Optimal- 
Solver,  t  =  1  and  H0  as  heuristic,  i.e.,  initially  L(.s-)  =  0  for  all  s  G  S  (Line  3).  The  first  short¬ 
sighted  SSP  generated  and  solved  is  §S0)i  (Figure  3.2(a))  and,  after  Line  8  is  executed,  we  have 
that  V(s0)  =  10.  Since  delegate  costs  11,  a  place  action  is  chosen  and  applied  until  the  current 
state  s  changes  from  s0  to  a  state  s'  ^  s0.  Denote  this  chosen  action  as  a.  Once  s'  is  reached,  §s/;i 
is  generated,  solved  and  R(s')  is  updated  to  a  value  greater  than  zero  since  s'  is  not  a  goal  state  of 

'SSiPP  is  pronounced  as  the  word  “sip.” 
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1  Run-SSiPP-Until-Convergence(SSP  §  =  (S.  s0,  G,  A,  P,  C ),  t  eW,H  a  heuristic 
for  V*,  e  >  0) 

2  begin 


V  <—  Value  function  for  S  with  default  value  given  by  H 

while  R(S,  V)  >  e  do 
|_  V  <-  SSiPP(S,  t,V,  e) 

return  V 


Algorithm  3.3:  Algorithm  to  compute  an  e- approximation  of  V*  using  SSiPP  (Algorithm  3.2). 


the  original  problem.  When  the  state  s(l  is  revisited  for  the  first  time,  the  expected  cost  of  applying 
a  in  §SOii  using  V  as  heuristic  equals  0.9(1  +  V(s0))  +  0.1(1  +  V(s'))  =  10  +  O.lV(s')  >  10 
since  V(s')  >  0.  Therefore  action  a  is  not  chosen  since  the  expected  cost  of  applying  any  of  the 
remaining  two  place  actions  in  s0  is  10.  As  we  prove  in  the  next  section,  this  process  continues 
and  the  optimal  solution  is  found. 


3.3.1  Guarantees 

In  this  section,  we  prove  that:  SSiPP  performs  Bellman  backups  (Theorem  3.7);  SSiPP  termi¬ 
nates  (Theorem  3.8);  and  Algorithm  3.3  is  asymptotically  optimal  (Theorem  3.9),  that  is,  if  the 
same  problem  is  solved  sufficiently  many  times  by  SSiPP,  then  the  optimal  policy  is  found. 
Theorem  3.7.  Given  an  SSP  §  =  (S.  s0,  G,  A,  P,  C)  such  that  the  Assumption  2.1  holds,  and  a 
monotonic  lower  bound  H  for  V*,  then  the  loop  in  Line  8  of  SSiPP  (Algorithm  3.2)  is  equivalent 
to  applying  at  least  one  Bellman  backup  on  V_for  every  state  s'  G  S  Ss>*  \  Gs  /. 

Proof.  Let  S  denote  S~:;sA  \  GS)t.  After  the  loop  in  Line  8  is  executed,  we  have  that,  for  all 
s'  G  S,  V(s')  equals  14*  t  (s').  Thus,  we  need  to  prove  that,  (BVf(s')  <  147  f  (s')  Vs'  G  S, 
since  V  is  monotonic  and  admissible  (Theorem  2.1).  By  the  definition  of  short-sighted  SSP 
(Definition  3.2),  every  state  s'  G  S  is  such  that  {s"  G  S|P(s"|s',a)  >  0,Va  G  A}  C  SSjt, 
i.e.,  the  states  reached  after  applying  an  action  in  a  state  s'  G  S  belong  to  SSjt.  Therefore, 
(BV_)  (s')  =  (Bs  tV)  (s')  Vs'  G  S,  where  Bs  t  is  the  Bellman  operator  B  for  §s>t.  Since  V  is  mono¬ 
tonic  and  admissible,  (BSjtV)(s')  <  Vfgt(s').  Therefore,  (BV_)(s')  <  Vg*  t(s')  Vs'  G  S.  □ 

Theorem  3.8.  SSiPP  always  terminates  under  the  same  conditions  of  Theorem  3.7. 

Proof.  Suppose  SSiPP  does  not  terminate.  Then,  there  exists  a  trajectory  T  of  infinite  size  that 
can  be  generated  by  SSiPP.  Since  S  is  finite,  then  there  must  be  an  infinite  loop  in  T  and,  for  all 
states  s  in  this  loop,  V(s)  diverges  as  the  execution  continues.  Because  Assumption  2.1  holds 
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for  §,  we  have  that  V*  ( s )  <  oo  for  all  s  G  S.  A  contradiction  since  SSiPP  maintains  V,  initialized 
as  H,  admissible  and  monotonic  (Theorems  3.3  and  3.7),  i.e.,  R(s)  <  V*(s)  for  all  s  G  S.  □ 

Theorem  3.9.  Given  an  SSP  §  =  (S,  s0,  G,  A,  P,  C)  such  that  the  Assumption  2.1  holds,  a  mono- 
tonic  lower  bound  H  for  V*,  and  t  G  N*,  then  the  sequence  (F°,  V1,  •  •  •  ,  V_k),  where  V_°  =  H 
and  Vf  =  SSiPP(S,  t,  Vf~l),  converges  to  V*  as  k  — >■  oofor  all  s  G  Sw*. 

Proof.  Let  the  sequence  of  states  H  =  (s0,  si,  s2,  •  •  •)  be  the  concatenation  of  the  trajectories  % 
of  states  visited  by  SSiPP  when  Vf  is  computed.  By  Theorem  3.8,  71  has  finite  size,  therefore 
\H\  is  finite.  Since  Assumption  2.1  holds  for  §  and  H  is  admissible  and  monotonic,  when 
k  — >■  oo,  we  can  construct  an  SSP  =  (S^,  s0,  G^,  A^,  P,  C)  such  that  [Barto  et  al.,  1995, 
Theorem  3,  p.  132]:  Soo  C  S  is  the  non-empty  set  of  states  that  appear  infinitely  often  in  77; 
Goo  C  G  is  the  non-empty  set  of  goal  states  that  appear  infinitely  often  in  77;  and  C  A  is 
the  set  of  actions  a  such  that  P(s'\s,  a)  =  0  for  all  s  G  Sx  and  s'  G  S  \  S.^.  Therefore,  there 
is  a  finite  time  step  T  such  that  the  sequence  7 f  of  states  visited  after  time  step  T  contains  only 
states  in  Soo.  By  Theorem  3.7,  we  know  that  at  least  one  Bellman  backup  is  applied  to  Sj  for 
any  time  step  j.  Thus,  after  time  step  T,  the  sequence  of  Bellman  backups  applied  by  SSiPP  is 
equivalent  to  asynchronous  value  iteration  on  §oc  and  Vk(s)  converges  V* (s)  for  all  s  G  Soo  as 
k  — >  oo  [Bertsekas  and  Tsitsiklis,  1996,  Proposition  2.2,  p.  27].  Furthermore,  S"*  C  Soo  [Barto 
et  al.,  1995,  Theorem  3].  □ 

3.4  The  n-Dominoes  Line  Problem 

In  this  section,  we  generalize  the  3-dominoes  line  problem  to  any  number  of  dominoes  (Ex¬ 
ample  3.1).  The  obtained  series  of  problems,  the  n-dominoes  line  problems,  has  characteris¬ 
tics  that  illustrate  the  benefits  of  short-sighted  planning  as  the  the  parameters  of  the  problem 
varies  [Veloso  and  Blythe,  1994].  Precisely,  we  illustrate  the  trade-offs  of  short-sighted  planning 
by  analyzing  how  the  cost  of  delegate  and  the  failure  probability  of  the  actions  place  influ¬ 
ence  the  solutions  for  the  n-dominoes  line  problems.  Then  we  present  an  empirical  comparison 
between  RTDP,  FF-Replan  and  SSiPP  in  the  n-dominoes  line  problems  for  different  parameters. 

Example  3.1  (n-dominoes  line).  Informally,  given  n  dominoes,  the  goal  of  this  problem  is  to 
build  a  line  using  all  the  dominoes.  The  actions  piace((),  for  i  G  {0, . . . ,  n  —  1},  represent 
placing  a  domino  in  the  position  i  of  the  line  being  built.  Every  action  place(i)  has  cost  1  and 
can  fail  with  probability  1  —  p,  in  which  case  all  the  dominoes  already  placed  are  dropped  and 
the  line  needs  to  be  rebuild  from  scratch.  If  the  domino  line  is  empty,  the  action  delegate  can 
be  applied.  Delegate  costs  k  and  deterministically  builds  the  n-dominoes  line. 
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action:  ai 
pre:  —<  k 

with  probability p: 
add:  li 
del:  0 

with  probability  1  — p: 
add:  0 

del.  lo,  •  •  •  ,  ln— l 
cost:  1 

(a) 


action:  d 
pre.  —'Iq,  ,  —'In—  l 

add.  Iq,  ,  In— i 

del:  0 
cost:  k 


(b) 


Figure  3.4:  Definition  of  the  actions  in  the  n-dominoes  line  problems.  Actions  a,  (place(i))  and 
d  (delegate)  are  presented  using  probabilistic  STRIPS  in  (a)  and  (b)  respectively. 


Formally,  we  represent  the  domino  line  using  the  binary  state  variables  /0,  •  •  •  ,  ln-  i  where 
is  true  if  there  is  a  domino  at  position  i  of  the  line.  We  denote  the  actions  place(i)  by  a,  and 
delegate  by  d.  Figure  3.4  shows  the  formal  definition  of  a*  and  d.  In  the  initial  state  so,  all  the 
state  variables  are  false  and  the  goal  set  G  equals  {.sy;}  where  $c  is  the  state  in  which  all  state 
variables  are  true. 

The  n-dominoes  line  problem  has  n!  +  1  closed  policies:  n(i,  that  selects  action  d  on  s0;  and 
the  n!  policies  representing  the  permutations  of  7ra  =  (a0,  ai,  •  •  •  ,  an_ i),  where  a,+  \  is  applied 
when  ai  succeeds,  i.e.,  results  in  a  state  s  Y  s0.  Notice  that  every  permutation  of  7ra  results  in 
the  same  overall  policy  in  which  the  dominoes  line  is  built  one  piece  at  the  time.  Since  every 
action  a*  has  the  same  probability  p  of  succeeding,  in  which  case  I,  is  changed  from  false  to  true, 
same  probability  of  returning  to  the  initial  state  and  same  cost,  then  every  permutation  of  7ra  has 
the  same  expected  cost  V"Ka(so )  to  reach  the  goal  from  the  initial  state. 

In  order  to  compute  Vna(so),  consider  the  recurrence  T(i)  which  represents  the  expected 
cost  of  solving  the  problem  of  size  n  by  using  na  when  there  are  only  i  dominoes  missing  in  the 
line.  Clearly,  T(0)  =0  since  the  dominoes  line  is  done  when  no  domino  is  missing  in  the  line. 
Moreover,  we  have  VWa  (s0)  =  T(n),  because  the  domino  line  is  empty  in  the  initial  state.  Since 
the  domino  line  is  constructed  by  adding  one  domino  at  each  time  step,  then,  for  i  e  (1,  •  •  •  ,  n}, 
we  have  that  T(i )  =  1  +  (1  —  p)T(n)  +  pT(i  —  1).  Let  cx:y  denote  unrolling  T(i), 

we  get  T(i )  =  c0:i_i  +  c0:;_i(l  —  p)T(n)  for  i  e  (1,  •  •  •  ,  n},  therefore 

n—  1 

V”“(s0)  =  T(n)  =  C0:n- 1  +  CD:n-l(l  -  p)v*°(s. o)  =  = 

F  3=0 

Since  d  is  deterministic  and  has  cost  k,  we  have  that  V^so)  =  k,  thus  Tid  is  the  only  optimal 
policy  for  the  n-dominoes  line  problem  when  k  <  V**{80)  =  EJ=o  Pj  n ■ 
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- SSiPP  t=1 - SSiPP  t=2 - SSiPP  t=3  SSiPP  t=4 - SSiPP  t=5 - RTDP  Trial  FF-Replan 


Figure  3.5:  Average  and  95%  confidence  interval  for  the  number  of  actions  to  reach  the  goal  of 
the  10-dominoes  line  problem.  For  this  experiment,  100  samples  were  used  and  the  parameter  p 
of  the  dominoes  line  problem  equals  0.5  Given  a  value  of  i  in  the  x-axis,  the  cost  of  delegate 
equals  Xo=o  0-5-i-*  —  1,  be.,  the  expected  cost  of  building  a  line  of  i  dominoes  decreased  by  1. 
FF-Replan  performance  is  constant  because  it  always  apply  the  same  policy  regardless  the  cost 
of  delegate. 


To  demonstrate  the  trade-offs  of  SSiPP,  consider  the  expected  cost  VI a  (s0)  of  7ra  applied  in 
the  first  short-sighted  SSP  solved  by  SSiPP,  i.e.,  SSOtt  using  H0  as  heuristic.  If  t  >  n,  then 
§.S(h<  equals  the  original  problem,  because  all  the  states  can  be  reached  using  at  most  n  ac¬ 
tions;  thus  VI"  (.So)  =  l/^Oo).  For  t  <  n,  every  artificial  goal  of  §So  t  represents  a  line 
of  t  dominoes  and  VV  (s0)  =  T(t)  =  =o  P**  since  Ills)  =  0  for  all  s  G  Ga.  Therefore, 
if  k  <  143"  (s0)  =  then  SSiPP  using  the  parameter  t  always  selects  nd,  which  is  also 

the  optimal  solution  for  the  original  problem;  moreover,  at  most  |SS0,tl  =  1  +  E!=o  (;)  states 
are  visited,  i.e.,  all  the  states  necessary  to  build  a  line  of  t  dominoes  using  n  dominoes  plus  the 
original  goal  state,  while  the  original  problem  has  2n  states. 

When  V^“  t(s0)  <k<  Vn"  (sq),  SSiPP  can  still  infer  that  n(i  is  the  optimal  solution  efficiently. 
We  illustrate  this  case  by  empirically  comparing  RTDP-Trial  (Algorithm  2.1  Line  7),  FF-Replan 
and  SSiPP  for  different  values  of  t.  Figure  3.5  shows  the  number  of  actions  to  reach  the  goal  in 
the  10-dominoes  line  problem  for  p  =  0.5  as  a  function  of  k.  Each  value  of  k  considered  equals 
VI"  (s0)  —  1,  where  i  is  the  x-axis  of  Figure  3.5.  Therefore,  for  i  G  {!,•••  ,  10},  nd  is  the  optimal 
solution  and  SSiPP  using  t  <  i  always  chooses  7Td,  i.e.,  solves  the  problem  using  only  one  action. 
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3.5  Summary 

In  this  chapter,  we  described  short-sighted  Stochastic  Shortest  Path  Problems  (short-sighted 
SSPs),  the  main  concept  for  short-sighted  probabilistic  planning.  We  then  formally  proved  the 
properties  of  the  solutions  of  short-sighted  SSPs,  in  particular  that,  under  common  assumptions, 
it  is  a  lower  bound  on  the  solution  of  the  original  problem.  Moreover,  we  showed  that  a  closed 
policy  for  an  (s,  t)-short-sighted  SSP  can  be  executed  for  at  least  t  steps  from  s  in  the  original 
SSP  without  replanning. 

We  also  introduced  Short-Sighted  Probabilistic  Planner  (SSiPP),  an  algorithm  that  solves 
probabilistic  planning  problems  by  iteratively  solving  short-sighted  SSPs  and  using  their  optimal 
solutions  to  update  the  lower  bound  on  the  optimal  solution  of  the  original  SSP.  We  formally 
proved  that,  under  common  assumptions,  SSiPP  always  reaches  a  goal  state  of  the  original  prob¬ 
lem  and,  if  the  same  problem  is  solved  sufficiently  many  times  by  SSiPP,  then  the  optimal  policy 
is  found.  Using  the  n-dominoes  line  problems  introduced  in  this  chapter,  we  illustrated  how 
SSiPP  is  able  to  efficiently  compute  the  solution  of  probabilistic  planning  problems. 

In  the  next  chapter,  we  extend  the  concept  of  short-sighted  SSP  by  changing  how  states  are 
pruned,  e.g.,  we  use  the  probability  of  reaching  a  state  instead  of  the  action-distance  function  5. 
Chapter  5  presents  extensions  of  SSiPP  to  incorporate  other  techniques  from  the  probabilistic 
planning  community,  e.g.,  labeling  of  converged  states  and  determinizations. 


Chapter  4 


General  Short-Sighted  Models 


In  this  chapter,  we  extend  the  definition  of  depth-based  short-sighted  SSPs.  We  begin  by  intro¬ 
ducing  trajectory-based  short-sighted  SSPs,  in  which  states  that  have  low  probability  of  being 
reached  are  pruned  from  the  state  space  [Trevizan  and  Veloso,  2012b].  Next,  in  Section  4.2,  we 
present  greedy  short-sighted  that  uses  only  the  best  k  states  according  the  current  bound  on  V* 
and  their  probability  of  being  reached  as  state  space.  In  Section  4.3,  we  prove  a  set  of  sufficient 
conditions  under  which  SSiPP  always  terminates  and  is  asymptotically  optimal  [Trevizan  and 
Veloso,  2012b]. 


4.1  Trajectory-Based  Short-Sighted  SSPs 

To  motivate  the  definition  of  trajectory-based  short-sighted  SSPs,  consider  the  SSP  shown  in  Fig¬ 
ure  4.1.  In  this  example,  there  are  two  closed  policies,  i r0  =  {(s0,  d0),  (s),  a 0),  (s'2,  a0),  (s'3,  a0)} 
and  Tii  =  {(s0,  ai),  (si,  «i),  (s2,  di),  (s3,  ai)},  representing,  the  bottom  and  top  chains,  respec¬ 
tively.  Optimal  policy  n*  is  7r0  because,  both  a0  and  a ,  have  the  same  cost  independently  of 
their  outcomes,  the  length  of  both  chains  is  the  same  and  oo  has  a  lower  self-loop  probability, 
i.e.,  P(sli\s'i,  a0)  <  P(si\si,ai). 

Figure  4.2  depicts  the  (s0,  t) -depth-based  short-sighted  SSPs  (Definition  3.2)  associated  with 
the  example  in  Figure  4.1.  For  this  example,  the  state  space  SsrjJ  of  the  (s0,  f)-depth-based 
short-sighted  SSP  contains  t  states  of  both  chains  because  depth-based  short-sighted  SSPs  ignore 
probabilities  for  the  generation  of  SSOjt.  In  the  next  section,  we  introduce  trajectory-based  short¬ 
sighted  SSPs,  a  new  class  of  short-sighted  SSPs  that  prunes  states  based  on  their  probability  of 
being  reached  as  opposed  to  their  distance. 
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.6  .6 


Figure  4.1:  Example  of  SSP  to  motivate  the  definition  of  trajectory -based  short-sighted  SSPs. 
The  initial  state  is  s0,  the  goal  set  is  G  =  {sg},  C(s,  a ,  s')  =  1,  Vs  G  S,  a  G  A  and  s'  G  S. 
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(a)  t  =  1 

(b)  t  =  2 

Figure  4.2:  Examples  of  (s0,  t) -depth-based  she 
t  >  4,  the  (s0,f)-depth-based  short-sighted  SSP 
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(c)  t  =  3 

-sighted  SSPs  for  the  SSP  in  Figure  4.1.  For 
juals  the  original  SSP. 


4.1.1  Definition 

Trajectory-based  short-sighted  SSPs  (Definition  4.2)  address  the  issue  of  states  with  low  proba¬ 
bility  of  being  reached  by  explicitly  defining  its  state  space  S.VJ  based  on  the  maximum  proba¬ 
bility  Pmax(s,  s')  of  a  trajectory  starting  at  s  and  stopping  at  s': 

Definition  4.1  (Pmax(s,  s')).  The  maximum  trajectory  probability  between  two  states  s  and  s'  is: 

{1  if  s  =  s' 

0  if  s  7^  s'  and  s  G  G  . 

max  P  ( s  I  s ,  a )  Pmax  (s ,  s' )  otherwise 

sGS,aGA 

Definition  4.2  (Trajectory-Based  Short-Sighted  SSP).  Given  an  SSP  §  =  (S,  s0,  G,  A,  P,  C), 
a  state  s  e  S,  p  e  [0,1]  and  a  heuristic  H,  the  (s,  p) -trajectory -based  short-sighted  SSP 
=  (S SiP,  s,  Gs  p,  A,  P,  CSiP)  associated  with  §  is  defined  as: 

•  S 8jP  =  {s'  G  S|3s  G  S  and  a  G  A  s.t.  Pmax(s,  s)  >  p  and  P(s'|s,  a)  >  0}; 

•  GS)P  =  (G  n  Ss,p)  U  (Ss,p  n  {s'  G  S|Pmax(s,  s')  <  p})\ 
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(a)  p  =  1.0 


(b)  P( E  [0.75,1.0) 


(c)  p  G  [0.752, 0.75) 


.25 


.25 


(d)  p  G  [0.753, 0.752) 


Figure  4.3:  Examples  of  (s0,  p) -trajectory -based  short-sighted  SSPs  for  the  SSP  in  Figure  4.1. 


C(s',a,s")  +  H(s")  if  s"  G  GSiP 
C (s',  a ,  s")  otherwise 


otherwise 


Vs'  G  S S)p,  cl  G  A,  s"  G  SS)P 


For  simplicity,  when  H  is  not  clear  by  context  nor  explicit,  then  H(s)  =  0  for  all  s  G  S. 

Figure  4.3  shows,  for  values  of  p  G  [0.753, 1],  the  trajectory-based  E>S0:P  for  the  SSP  in  Fig¬ 
ure  4.1.  For  instance,  if  p  =  0.753  (Figure  4.3(d))  then  SSO;0.753  =  {s0,  si,  s'1;  s'2,  s'3,  sG}  and 
G s0, o.75  =  {si,sg}.  The  case  shows  p  =  0.753  how  trajectory-based  short-sighted  SSP  can  be 
more  efficient  in  managing  uncertainty  efficiently:  |SSOiP|  =  6  and  the  goal  of  the  original  SSP 
sq  is  already  included  in  Sso>p  while,  for  the  depth-based  short-sighted  SSPs,  sc  G  SS(ht  only  for 
t  >  4  case  in  which  | SS0)t |  =  |S|  =8. 

Notice  that  the  definition  of  SSiP  cannot  be  simplified  to  {s  G  S| Pmax(s,  s)  >  p }  since  not 
all  the  resulting  states  of  actions  would  be  included  in  SSiP.  For  example,  consider  the  SSP  in 
Figure  4.4(a);  the  set  of  states  S'  =  {s  G  S|Pmax(s0,  s)  >  p}  =  {s0,  sh}  for  all  p  G  (0.1, 0.9]. 
Therefore,  if  we  use  S'  to  generate  a  short-sighted  SSP,  an  invalid  SSP  would  be  obtained  (Fig¬ 
ure  4.4(c))  because  action  a  is  included  in  the  model  and  sl,  an  effect  of  a  with  non-zero  proba¬ 
bility,  is  not  in  the  state  space  S'. 

4.1.2  Triangle  Tire  World 

In  this  section,  we  use  the  triangle  tire  world  [Fittle  and  Thiebaux,  2007]  series  of  problems  to 
show  the  advantage  of  trajectory-based  short-sighted  SSPs.  In  the  triangle  tire  world  problems, 
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Figure  4.4:  Example  of  why  the  definition  of  SS:P  cannot  be  simplified,  (b)  (s0, 0.8)-trajectory- 
based  short-sighted  SSP  associated  with  SSP  in  (a),  (c)  Ill-defined  SSP  obtained  when 

S'  =  {s  G  S|Pmax(g0,  s)  >  0.8}  =  (s0,  sH}-  the  state  sL  is  reachable  however  sL  S'. 


a  car  has  to  travel  between  locations  in  order  to  reach  a  goal  location  from  its  initial  location. 
Every  time  the  car  moves  between  locations,  a  flat  tire  happens  with  probability  0.5.  The  car 
carries  only  one  spare  tire  which  can  be  used  at  anytime  to  fix  a  flat  tire.  Once  the  spare  tire  is 
used,  a  new  one  can  be  loaded  into  the  car;  however,  only  some  locations  have  an  available  new 
tire  to  be  loaded.  The  actions  load-tire  and  change-tire,  are  deterministic. 

The  roads  between  locations  are  one-way  only  and  the  roadmap  is  represented  as  a  directed 
graph  in  a  shape  of  an  equilateral  triangle.  Each  problem  in  the  triangle  tire  world  is  represented 
by  a  number  n  G  N*  corresponding  to  the  roadmap  size.  Figure  4.5(a)  illustrates  the  roadmap 
for  the  problems  1,  2  and  3  of  the  triangle  tire  world.  The  initial  and  goal  locations,  /0  and  lc 
respectively,  are  in  two  different  vertices  of  the  roadmap  and  their  configuration  is  such  that: 

•  the  shortest  path  policy  from  l0  and  lc  has  probability  0.52n_1  of  reaching  the  goal;  and 

•  the  only  proper  policy,  and  therefore  the  optimal  policy,  is  the  policy  that  takes  the  longest 
path. 

Moreover,  every  triangle  tire  world  problem  is  a  probabilistic  interesting  problem  [Little 
and  Thiebaux,  2007]  because  only  the  optimal  policy  reaches  the  goal  with  probability  1.  This 
property  is  illustrated  by  the  shades  of  gray  in  Figure  4.5(a)  that  represents,  for  each  location  /, 
max*.  P(car  reaches  l  and  the  tire  is  not  flat  when  following  the  policy  n  from  so).  Figure  4.5(b) 
shows  the  size  of  the  state  space  S  and  |S7r|,  i.e.,  the  number  of  states  reachable  from  so  when 
following  the  optimal  policy  tt*  ,  for  problems  up  to  n  —  60. 

Since  the  only  proper  policy  is  not  complete,  Assumption  2.1  does  not  hold  for  the  triangle 
tire  world  problems,  i.e.,  they  contain  avoidable  dead  ends.  All  dead  ends  of  triangle  tire  world 
problems  are  states  in  which  the  tire  is  flat  and  there  is  not  spare  tire.  Since  the  car  cannot  move 
when  the  tire  is  flat,  these  dead  ends  are  states  in  which  no  action  is  available.  Therefore,  planners 
can  trivially  detect  when  a  dead  end  Sd  is  reached,  in  which  case  V (s(i)  is  updated  to  infinity.  In 
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Probability 


S 


0 


(b) 

Figure  4.5:  Map  and  state  space  statistics  of  the  triangle  tire  world,  (a)  Roadmap  of  the  triangle 
tire  world  for  the  sizes  1,  2  and  3.  Circles  (squares)  represent  locations  in  which  there  is  one  (no) 
spare  tire.  In  the  initial  state  the  car  is  at  l0  and  the  tire  is  not  flat;  the  goal  is  to  reach  location 
la ■  The  shades  of  gray  represent,  for  each  location  l,  maxT  P(car  reaches  l  and  the  tire  is  not  flat 
when  following  the  policy  7r  from  s0).  (b)  Log-lin  plot  of  the  state  space  size  (|S|)  and  the  size 
of  the  states  reachable  from  s0  when  following  the  optimal  policy  tt*  (IS71-*  |)  versus  the  number 
of  the  triangle  tire  world  problem. 
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Run-SSiPP(SSP  §  =  (S,  s0,  G,  A ,P,C),t  e  N*,  H  a  heuristic  for  V*,  e  >  0) 

begin 

V  <—  Value  function  for  S  with  default  value  given  by  H 

g  o 

for  %  e  {1, . . . ,  k}  do 
V  <-  SSiPP(§,  t,V,  e) 
if  SSiPP  reached  the  goal  then 
ig^g+l 


9  return  g 


Algorithm  4.1:  Algorithm  to  run  SSiPP  (Algorithm  3.2)  k  times  reusing  the  inferred  bound  V. 


practice,  the  value  assigned  to  V (sd)  can  be  any  value  larger  than  12n  because  V*(s0)  <  12n  for 
the  triangle  tire  world  problem  of  size  n. 

Next,  we  compare  SSiPP  using  depth-based  and  trajectory-based  short-sighted  SSPs  in  order 
to  solve  triangle  tire  world  problems.  Up  to  this  point,  we  have  not  proved  that  SSiPP  terminates 
(or  converges)  when  trajectory-based  short-sighted  are  used  instead  of  depth-based  short-sighted 
SSPs.  In  Section  4.3,  we  prove  that  SSiPP  terminates  and  is  optimal  for  a  class  of  short-sighted 
SSPs  that  includes  trajectory-based  short-sighted  SSPs. 

Due  to  the  large  size  of  S"*  (Figure  4.5(b)),  it  is  infeasible  to  run  SSiPP  until  e-convergence 
(Algorithm  3.3).  Thus,  we  evaluate  depth-based  and  trajectory-based  short-sighted  SSPs  using 
Algorithm  4.1  for  k  =  50,  i.e.,  we  run  SSiPP  50  times  reusing  the  inferred  lower  bound  (Line  6). 
Our  evaluation  metric  is  the  valued  returned  by  Algorithm  4.1,  i.e.,  the  number  of  iterations  of 
SSiPP  that  reached  the  goal.  Because  of  the  dead  ends,  not  all  executions  of  SSiPP  might  reach 
the  goal,  thus  the  performance  of  each  planner  is  a  number  between  0  and  50. 

Table  4. 1  presents  the  average  of  10  runs  of  Algorithm  4. 1  for  depth-based  and  trajectory -based 
short-sighted  SSPs.  We  used  the  zero  heuristic  for  both  models,  t  =  8  for  depth-based  short-sighted 
SSPs,  and  p  e  {0.125,  0.25,  0.5}  for  trajectory-based  short-sighted  SSPs.  For  trajectory-based,  we 
also  considered  an  exploration  budget  approach,  i.e.,  we  fix  the  total  number  of  states  in  SS:P  to  be 
approximately  the  same  as  in  the  depth-based  short-sighted  SSP  for  t  —  8  and  state  s.  Formally, 
before  §Si/0  is  computed  in  Algorithm  3.2  Line  6,  we  compute  the  state  space  |S|  ofthe  (s,  8)-depth- 
based  short-sighted  SSP  and  choose  p  =  argmaxp{|SSjP|  s.t.  |SS)P|  <  |S| }.  Since  S  depends  on  the 
current  state  s,  the  value  of  p  might  differ  for  each  Ssp,  generated  to  solve  a  given  SSP. 

All  the  parametrizations  of  SSiPP  using  trajectory-based  outperforms  SSiPP  using  depth- 
based  short-sighted  SSPs.  SSiPP  using  trajectory-based  and  p  e  {0.5,  0.125}  is  especially  note¬ 
worthy  because  it  achieves  the  perfect  score  in  all  problems,  i.e.,  it  reaches  a  goal  state  in  all  the 
50  iterations  in  all  the  10  runs  for  all  the  problems.  This  interesting  behavior  of  SSiPP  using 
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Triangle  Tireworld  Problem  Number 

Short-Sighted  Model 

5 

10 

15 

20 

25 

30 

35 

40 

45 

50 

55 

60 

Depth  t  =  8 

44.6 

43.3 

43.1 

43.3 

43.7 

43.7 

42.9 

42.5 

42.1 

37.8 

16.3 

- 

Trajectory  w.  budget 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

Trajectory  p  =  0.50 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

Trajectory  p  =  0.25 

48.6 

47.3 

45.4 

44.6 

44.6 

45.1 

44.1 

44.9 

44.2 

43.9 

43.8 

43.4 

Trajectory  p  =  0.125 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

50.0 

Table  4.1:  Performance  comparison  between  depth-based  and  trajectory  based  short-sighted 
SSPs  for  the  triangle  tire  world.  Each  value  represents  the  average  over  10  runs  of  Algorithm  4.1. 
For  depth-based  short-sighted  SSPs,  the  parameter  t  equals  8;  for  trajectory-based  short-sighted 
SSPs,  different  values  of  p  and  a  budget  approach  are  considered.  The  95%  confidence  interval 
is  less  than  2.0  in  all  the  obtained  results,  except  for  depth-based  in  problem  55,  in  which  case  it 
is  6.29.  Best  results  shown  in  bold  font. 


trajectory-based  short-sighted  SSPs  for  the  triangle  tire  world  can  be  explained  by  the  following 
theorem: 

Theorem  4.1.  For  the  triangle  tireworld,  SSiPP  using  trajectory-based  short-sighted  SSPs  and 
an  admissible  heuristic  never  falls  in  a  dead-end  for  p  G  (0.5*+1,  0.51]  and  i  6  (1,3,5,...  }. 

Proof  The  optimal  policy  for  the  triangle  tire  world  is  to  follow  the  longest  path:  move  from 
the  initial  location  /0  to  the  goal  location  la  passing  through  location  /c,  where  l0,  lc  and  lG  are 
the  vertices  of  the  triangle  formed  by  the  problem’s  roadmap  (Figure  4.5(a)).  The  path  from  lc  to 
la  is  unique,  i.e.,  there  is  only  one  applicable  move-car  action  for  all  the  locations  in  this  path. 
Therefore  all  the  decision  making  to  find  the  optimal  policy  happens  between  the  locations  /()  and 
lc.  Each  location  /'  in  the  path  from  Z0  to  lc  has  either  two  or  three  applicable  move-car  actions 
and  we  refer  to  the  set  of  locations  l'  with  three  applicable  move-car  actions  as  N. 

Every  location  l'  e  N  is  reachable  from  l0  by  applying  an  even  number  of  move-car  actions 
and  the  three  applicable  move-car  actions  in  l'  are:  (i)  the  optimal  action  ac,  i.e.,  move  the  car 
towards  /c;  (ii)  the  action  aG  that  moves  the  car  towards  lG\  and  (iii)  the  action  ap  that  moves  the 
car  parallel  to  the  shortest-path  from  /0  to  lG.  The  location  reached  by  ap  does  not  have  a  spare 
tire,  therefore  ap  is  never  selected  since  it  reaches  a  dead-end  with  probability  0.5.  The  locations 
reached  by  applying  either  ac  or  aG  have  a  spare  tire  and  the  greedy  choice  between  them  depends 
on  the  admissible  heuristic  used,  thus  aG  might  be  selected  instead  of  ac.  However,  after  applying 
aG,  only  one  move-car  action  a  is  available  and  it  reaches  a  location  that  does  not  have  a  spare 
tire.  Therefore,  the  greedy  choice  between  ac  and  aG  considering  two  or  more  move-car  actions  is 
optimal  under  any  admissible  heuristic:  every  sequence  of  actions  ( aG ,  a, . . .)  reaches  a  dead-end 
with  probability  at  least  0.5  and  at  least  one  sequence  of  actions  starting  with  ac  has  probability 
0  to  reach  a  dead-end,  e.g.,  the  optimal  solution. 
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Given  p,  we  denote  as  L,.p  the  set  of  all  locations  corresponding  to  states  in  S.S)P  and  as  ls 
the  location  corresponding  to  the  state  s.  Thus,  L,  p  contains  all  the  locations  reachable  from  ls 
using  up  to  m  =  [logo. 5  p\  +  1  move-car  actions.  If  m  is  even  and  ls  G  N,  then  every  location 
in  Ls  p  (T  N  represents  a  state  either  in  Gsp  or  at  least  two  move-car  actions  away  from  any  state 
in  Qs  p.  Therefore  the  solution  of  the  (s,  p) -trajectory-based  short-sighted  SSP  only  chooses  the 
action  ac  to  move  the  car.  Also,  since  m  is  even,  every  state  s  used  by  SSiPP  for  generating 
(s,  p) -trajectory -based  short-sighted  SSPs  has  ls  G  N.  Therefore,  for  even  values  of  m,  i.e.,  for 
p  G  (0.5*+1,  0.5*]  and  i  G  {1,  3,  5, . . .  },  SSiPP  using  trajectory-based  short-sighted  SSPs  and  p 
always  chooses  the  actions  ac  to  move  the  car  to  lc,  thus  avoiding  all  the  dead-ends. 

□ 


4.2  Greedy  Short-Sighted  SSPs 

To  motivate  the  definition  of  greedy  short-sighted  SSPs,  consider  the  SSP  shown  in  Figure  4.6(a). 
In  this  example,  the  state  space  represents  a  full  binary  tree  of  depth  3,  with  nodes  labeled  from 
1  to  15,  incremented  with  a  special  state  r.  The  initial  state  is  so  =  1,  i.e.,  the  root  of  the  binary 
tree,  and  the  goal  is  to  reach  the  leaf  represented  by  node  13,  i.e.,  G  =  {13}.  Three  actions  are 
available  in  every  non-leaf  node  of  the  binary  tree:  left,  right  and  random.  The  action  random 
has  cost  1  and  moves  to  the  left  (right)  branch  of  the  tree  with  probability  0.5.  The  action  left 
(right)  has  cost  5  and  moves  to  the  left  (right)  branch  of  the  tree  with  probability  0.9;  with 
probability  0.1,  left  (right)  fails  and  moves  to  the  right  (left)  branch  of  the  tree. 

For  all  the  leaves  of  the  binary  different  of  the  goal  leaf  13,  the  action  restart  is  available 
and  it  deterministically  transition  to  the  state  r.  In  state  r,  the  action  restart  deterministically 
moves  to  the  root  node  of  the  binary  tree,  restart  has  cost  1  when  applied  on  a  tree  leaf  or  on  r. 
Therefore,  if  the  goal  leaf  13  is  not  reached,  the  agent  restarts  the  search  from  the  root  node  1; 
this  process  is  repeated  until  the  goal  leaf  is  reached. 

Figure  4.6(b)  shows  the  (s0,  2) -depth-based  short-sighted  SSP  associated  with  the  SSP  in 
Figure  4.6(a);  this  depth-based  short-sighted  SSP  is  equivalent  to  the  (s0,  p) -trajectory -based 
short-sighted  SSP  for  p  G  (0.92,  0.9].  Notice  that  state  space  and  goal  set  of  the  short-sighted 
SSP  in  Figure  4.6(a)  is  the  same  independently  of  the  heuristic  H  used  as  parameter,  e.g.,  the 
zero  heuristic. 

The  reason  for  the  state  space  and  goal  set  being  independent  of  the  heuristic  H  in  depth- 
based  and  trajectory-based  short-sighted  SSPs  is  because  H  is  used  only  for  incrementing  the 
cost  of  reaching  artificial  goals.  In  the  next  section,  we  introduce  greedy  short-sighted  SSPs,  a 
new  short-sighted  model  that  prunes  states  based  on  their  heuristic  cost  of  reaching  the  goal. 
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(a)  (b) 


Figure  4.6:  Example  of  an  SSP  to  motivate  the  definition  of  greedy  short-sighted  SSPs.  (a) 
Example  of  an  SSP.  The  initial  state  s0  is  the  node  1,  the  goal  set  is  G  =  {13}.  Actions  random 
and  restart  cost  1  and  are  represented  by  solid  black  and  dashed  black  arrows  respectively. 
Actions  left  (green  arrows)  and  right  (blue  arrows)  cost  5  and  succeed  with  probability  0.9. 
left  (right)  fails  with  probability  0.1  by  moving  to  right  (left)  branch  of  the  tree  (this  effect 
is  omitted  in  the  picture  for  ease  of  presentation),  (b)  (s0,  2) -depth-based  and  (s0,  p) -trajectory- 
based,  for  p  G  (0.92,  0.9],  short-sighted  SSP  associated  with  the  SSP  in  (a). 

4.2.1  Definition 

Algorithm  4.2  presents  our  approach  to  generate  a  short-sighted  state  space  that  takes  into  ac¬ 
count  a  given  heuristic  H .  This  algorithm  performs  a  best-first  search  from  the  given  state  s 
using  as  node  expansion  criterion  the  fringe  node  s'  that  minimizes  H(s')/Pmax(s,  s'),  i.e.,  the 
heuristic  value  of  s'  divided  by  the  maximum  trajectory  probability  between  s  and  s'  (Defini¬ 
tion  4.1).  The  search  fringe  is  stored  in  the  priority  queue  Q,  in  which  the  next  state  to  be  popped 
minimizes  the  expansion  criterion,  and  Q  is  initialized  with  the  input  state  s  (Lines  3  to  6).  Once 
a  state  s  £  G  is  popped  from  0  (Line  10),  then:  (i)  s  is  removed  from  the  short-sighted  SSP  goal 
set  Gs  /,  (Line  14),  i.e.,  s  is  not  considered  as  an  artificial  goal  anymore;  and  (ii)  s  is  expanded 
(Lines  15  to  19),  i.e.,  all  states  s’  such  that  there  exists  a  G  A  and  P(s'\s,  a)  >  0  are  added  to  Q. 

The  search  performed  by  Algorithm  4.2  terminates  once  S.s./,.  contains  k  or  more  states 
(Line  8).  In  order  to  guarantee  that  all  effects  of  actions  applied  to  states  in  S Sik  \  G s,k  be¬ 
long  to  SSjfc  (Ligure  4.4),  Algorithm  4.2  might  increase  the  size  of  Ssj,  beyond  k  by  adding  more 
states,  all  of  them  as  artificial  goals.  Therefore,  we  have  that  |S s>k  \  GSifc|  <  k ,  since  |GSifc|  >  0. 

Definition  4.3  formalizes  the  (s,  k)- greedy  short-sighted  SSPs,  where  k  is  the  size  of  the 
generated  short-sighted  state  space. 
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1  Generate-Greedy-Space(SSP  §  =  (S,  s0,  C,A,P,C),sES,kE  N*,  H  a  heuristic 
for  V*) 

2  begin 

3  Q  <-  Empty-Smallest-First-Priority-Queue 

4  Ss,fc  <—  {s} 

s  G Sjk  <—  {s} 

6  Q.Insert(0,  s) 

7  while  not  Q.isEmptyQ  do 

8  if  |Ss,fc|  >  k  then 

9  |_  Break 

10  s  <—  Q.Pop() 

11  if  s  E  G  then 

12  |  Continue 

13  else 

14  |_  G e-  GS)fc  \  {<§} 

is  foreach  a  6  A  and  s'  E  S  s.t.  P(s'|s,  a)  >  0  do 

16  if  s'  S  then 

17  S.s.fc  •*—  SSifc  U  {s'} 

18  G s^k  ^  G U  {s  } 

19  |_  Q.lNSERT(H(s')/Pmax(s,s'),s') 

20  |_  return  (SS)fc,  GSjfc) 

Algorithm  4.2:  Algorithm  to  generate  the  state  space  and  goal  set  for  greedy  short-sighted 
SSP 


Definition  4.3  (Greedy  Short-Sighted  SSP).  Given  an  SSP  §  =  (S.  s0,  G,  A,  P,  C ),  a  state  s  E  S, 
k  E  W  and  a  heuristic  H,  the  (s, /c)-greedy  short-sighted  SSP  §S;fc  =  (SS)fc,  s,  Gs,fc,  A.  P,  Cs>k) 
associated  with  §  is  defined  as: 


•  SSjfc  and  GSjfc  are  the  returned  values  of  Generate- Greedy- Space(§,  s,  k,  H)  (Algo¬ 
rithm  4.2);  and 


C(s',a,s")  +  H(s")  if  s"  E  Gs>fe 
C (s',  a,  s")  otherwise 


W  E  SStk,  a  E  A,  s"  E  SSifc 


For  simplicity,  when  H  is  not  clear  by  context  nor  explicit,  then  H(s)  =  0  for  all  s  E  S. 

Figure  4.7  shows  two  (s0,  7)-greedy  short-sighted  SSPs  associated  with  the  SSP  in  Fig¬ 
ure  4.6(a)  when  using  the  zero-heuristic.  Due  to  ties  in  the  zero-heuristic  ( H0(s )  =  0  for  all 
s  E  S),  five  greedy  short-sighted  SSPs  from  s0  using  k  —  7  are  possible:  one  for  each  branch 
containing  one  pair  of  leaves,  e.g.,  Figures  4.7(a)  and  4.7(b),  and  the  greedy  short-sighted  SSP 
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Figure  4.7:  Examples  of  (s0,  7) -greedy  short-sighted  SSPs  for  the  SSP  in  Figure  4.6.  (a) 
and  (b)  are  two  of  the  five  possible  (s0,  7)-greedy  short-sighted  SSPs  when  the  zero  heuristic 
is  used,  (b)  is  also  the  unique  (s0,  7)-greedy  short-sighted  SSPs  obtained  when  the  heuristic 
H'(s)  =  S(s,  13)  (Definition  3.1  p.21)  is  used. 

equivalent  to  the  depth-based  short-sighted  SSP  for  t  —  2  (Figure  4.6(b)).  Notice  that  the  greedy 
short-sighted  SSP  in  Figure  4.7(b)  contains  the  original  goal,  i.e.,  the  tree  leaf  labeled  13. 

To  further  illustrate  the  advantages  of  greedy  short-sighted  SSPs,  consider  the  heuristic  H'  de¬ 
fined  as  the  minimum  number  of  actions  from  s  to  the  goal  set.  Formally,  H'(s)  =  5(s,  13)  (Def¬ 
inition  3.1  p.21).  Using  IV  as  heuristic,  the  (s0,  7)-greedy  short-sighted  SSPs  associated  with 
the  SSP  in  Figure  4.6(a)  is  unique  and  is  depicted  in  Figure  4.7(b).  This  example  shows  how 
greedy  short-sighted  SSPs  can  take  advantage  of  informed  heuristics  to  generate  state  spaces 
biased  towards  the  goal  set  of  the  original  problem. 

4.2.2  The  n-Binary  Tree  Problem 

In  this  section,  we  generalize  the  binary  search  problem  in  Figure  4.6  to  full  binary  trees  any 
depth  n  (Example  4.1).  Then  we  present  an  empirical  comparison  between  SSiPP  using  depth- 
based  and  greedy  short-sighted  SSPs  in  the  n- binary  tree  problems  for  different  parameters  and 
values  n. 

Example  4.1  (n-binary  tree).  Given  n  E  W,  the  n-binary  tree  problem  contains  2n+1  states: 
S  =  {1,  2,  •  •  •  ,  2n+l  —  1,  r}.  The  initial  state  s0  is  1  and  the  goal  set  G  is  the  singleton  set  {sG}, 
where 

f2n  +  Ello22i  if  n  is  odd 

SG  =  \  n 

[2n  +  £?= i  22i_1  if  n  is  even 

For  the  states  i  G  {1,  •  •  •  ,  2n  —  1},  three  actions  are  available  random,  left  and  right;  the 
probability  of  reaching  the  state  2 i  is,  respectively,  0.5,  0.9,  and,  0.1  for  random,  left  and 
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right;  and  with  probability  0.5,  0.1,  and,  0.9,  the  state  2i  +  1  is  reached  using  random,  left 
and  right;  respectively.  In  the  states  i  €  {2n,  •  •  •  ,2n+1  —  1}  \  G,  the  only  available  action 
is  restart  and  P(r\i, restart)  =  1.  restart  is  also  the  only  available  action  in  state  r  and 
P(l|r, restart)  =  1.  Actions  random  and  restart  have  cost  1;  and  actions  left  and  right 
have  cost  5. 

We  empirically  compare  depth-based  and  greedy  short-sighted  SSPs  by  running  SSiPP  (Al¬ 
gorithm  3.2)  using  both  definitions  of  short-sighted  SSPs.  The  heuristic  used  in  this  experiment 
is  the  zero  heuristic  and  for  depth-based  short-sighted  SSPs,  we  use  t  G  {3,4}.  For  greedy 
short-sighted  SSPs,  we  choose  the  value  of  k  based  on  the  number  of  states  used  by  the  depth- 
based  short-sighted  SSPs.  Formally,  before  jfc  is  computed  in  Algorithm  3.2  Line  6,  we  com¬ 
pute  the  state  space  |S|  of  the  (s,  t) -depth-based  short-sighted  SSP  and  use  k  =  |S|.  We  refer 
to  this  parametrization  of  greedy  short-sighted  SSPs  as  “budget  t”  and,  in  this  experiment,  we 
consider  two  budget  parametrizations  budget  t  =  3  and  budget  t  —  4.  Trajectory-based  short¬ 
sighted  SSPs  are  not  considered  because  of  the  following  equivalence  between  them  and  depth- 
based  short-sighted  SSPs  in  the  n- binary  tree  problems:  for  all  p  e  (0, 1],  the  trajectory-based 
short-sighted  SSP  using  p  as  parameter  is  equivalent  to  the  depth-based  short-sighted  SSP  using 

t  =  Uogo.gPj  +  1- 

Figure  4.8  presents  the  results  of  this  experiment  as  average  and  95%  confidence  interval 
over  100  samples  for  the  accumulated  cost  to  reach  the  goal  in  the  n- binary  tree  problems.  Both 
parametrizations  of  SSiPP  using  greedy  short-sighted  SSPs  outperform  SSiPP  using  depth-based 
SSPs.  In  special,  the  budget  t  —  3  parametrization  of  greedy  short-sighted  SSPs  outperforms 
parametrization  t  —  4  of  depth-based  short-sighted  SSPs. 

Notice  that  the  zero  heuristic  does  not  favor  greedy  short-sighted  SSPs  since  this  heuristic 
provides  no  information  about  the  goal.  However,  SSiPP  improves  its  current  lower  bound  V 
every  time  a  short-sighted  SSP  is  solved  and  uses  the  improved  V  as  heuristic  for  the  subsequent 
short-sighted  SSPs  (Algorithm  3.2  Lines  6  and  8).  Therefore,  as  the  execution  of  SSiPP  evolves, 
the  greedy  short-sighted  SSPs  are  able  to  take  advantage  of  the  improved  lower  bound  V  in  order 
to  bias  the  short-sighted  state  space  Ss±  towards  the  goals  of  the  original  problem. 


4.3  Extending  SSiPP  to  General  Short-Sighted  Models 

In  Section  3.3,  we  proved  that  SSiPP  (Algorithm  3.2)  always  terminates  and  is  asymptotically 
optimal  for  depth-based  short-sighted  SSPs.  We  generalize  these  results  regarding  SSiPP  by: 
(i)  providing  the  sufficient  conditions  for  the  generation  of  short-sighted  problems  (Algorithm  3.2 
Line  6)  in  Definition  4.4;  and  (ii)  proving  that  SSiPP  implicitly  performs  Bellman  backups  (The- 
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Cost  to  reach  the  goal  in  the  binary  tree  problems 


Figure  4.8:  Results  for  the  binary-tree  domain  experiment.  Each  point  represents  the  average 
and  95%  confidence  interval  over  100  samples  for  the  accumulated  cost  to  reach  the  goal  in  the 
binary  tree  problems. 

orem  4.2),  and  always  terminates  (Theorem  4.3)  when  the  short-sighted  SSP  generator  respects 
Definition  4.4.  The  proof  that  SSiPP  is  asymptotically  optimal  (Theorem  3.9)  automatically 
follows  since  it  relies  only  on  the  fact  that  SSiPP  terminates  and  performs  Bellman  updates. 

Definition  4.4.  Given  an  SSP  (S,  s0,  G,  A,  P,  C ),  the  sufficient  conditions  on  the  short-sighted 
SSPs  (S',  s,  G' .  A,  P',  C")  returned  by  the  generator  in  Algorithm  3.2  Line  6  are: 

1.  GnS'C  G'; 

2.  s  ^  G  — )■  s  ^  G';  and 

3.  for  alls  £  S'yG^s'  £  S  and  a  £  A,  if  P(s'|s,  a)  >  0,  then  s'  £  S' and  P'^s',  a)  =  P(s|s',a 

Item  3  of  Definition  4.4  guarantees  that,  if  a  state  s  is  in  the  short-sighted  SSP  and  is  not  a 
goal,  i.e.,  s  £  S'  \  G',  then  the  resulting  states  of  all  applicable  actions  on  s  are  also  in  S'  (Fig¬ 
ure  4.4)  and  they  are  reachable  with  the  same  probability  as  in  the  original  SSP.  Notice  that, 
by  definition,  depth-based,  trajectory-based  and  greedy  short-sighted  SSPs  meet  the  sufficient 
conditions  presented  on  Definition  4.4. 

Theorem  4.2.  Given  an  SSP  §  =  (S,  So,  G,  A,  P,  C)  such  that  the  Assumption  2.1  holds,  a  mono¬ 
tonic  lower  bound  H  for  V*,  and  a  short-sighted  SSP  generator  that  respects  Definition  4.4,  then 
the  loop  in  Line  8  of  SSiPP  (Algorithm  3.2)  is  equivalent  to  applying  at  least  one  Bellman  backup 
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on  Vfor  every  state  s'  G  S71®  \  G,  where  §  =  (S.  s.  G,  A,  P,  C)  is  the  generated  short-sighted 
SSP  on  Line  6. 

Proof.  Let  U  denote  S4  \  G.  After  the  loop  in  Line  8  of  Algorithm  3.2  is  executed,  we  have  that, 
for  all  s'  G  U,  VYs')  equals  14*  (s').  By  item  1  of  Definition  4.4,  we  have  U  D  G  =  0,  therefore 

o> 

y(scf)  remains  equal  to  0  for  all  sq  G  G.  Thus,  we  need  to  prove  that,  ( BV_)(s ')  <  Ip  (s')  for  all 
s'  G  U,  since  V  is  monotonic  and  admissible  (Theorem  2.1).  By  item  3  of  Definition  4.4,  every 
state  s'  G  U  is  such  that  {sw  G  S|P(s//|s/,  a)  >  0 ,Va  G  A}  C  S.  Item  3  also  guarantees  that 
P(-|s',  a)  =  P(-|s/,  a)  for  all  s'  G  U  and  a  G  A,  therefore  (BV)(s')  =  (BV)(s')  for  all  s'  G  U, 
where  B  is  the  Bellman  operator  B  applied  in  the  short-sighted  SSP  S.  Since  V  is  monotonic 
and  admissible,  (. Bs>tV)(s ')  <  14*  t (s').  Therefore,  ( BV_)(s ')  <  Vf  t(s')  for  all  s'  G  U.  □ 

Theorem  4.3.  SSiPP  always  terminates  under  the  same  conditions  of  Theorem  4.2. 

Proof.  By  Assumption  2.1  there  is  no  dead  ends  in  §,  thus  c-Optimal-SSP-Solver  always 
terminates.  Since  the  short-sighted  SSP  §  is  an  SSP  by  definition,  then  a  goal  state  sq  G  G  of 
§  is  always  reached,  therefore  the  loop  in  Line  11  of  Algorithm  3.2  also  always  terminates.  If 
sg  is  a  goal  of  the  original  SSP,  i.e.,  sg  G  G,  then  SSiPP  terminates  in  this  iteration.  Otherwise, 
sg  G  G  \  G  and  sq  f  s  by  item  2  of  Definition  4.4,  i.e.,  sg  differs  from  the  state  s  used  as 
initial  state  for  the  short-sighted  SSP  generation.  Thus  another  iteration  of  SSiPP  is  performed 
using  sg  as  s  in  the  generation  of  a  new  short-sighted  SSP  (Line  6).  Suppose,  for  contradiction 
purpose,  that  every  goal  state  reached  during  SSiPP  execution  is  an  artificial  goal,  i.e.,  SSiPP 
does  not  terminate.  Then  infinitely  many  short-sighted  SSPs  are  solved.  Since  S  is  finite,  then 
there  exists  s  G  S  that  is  updated  infinitely  often,  therefore  14(4)  — »  oo.  However,  V*  (s)  <  oo  by 
Assumption  2.1.  Since  SSiPP  performs  Bellman  updates  (Theorem  4.2)  then  H(s)  <  V**(s)  by 
monotonicity  of  Bellman  updates  (Theorem  2.1)  and  admissibility  of  H ,  a  contradiction.  Thus 
every  execution  of  SSiPP  reaches  a  goal  state  sg  G  G  and  therefore  terminates.  □ 


4.4  Summary 

In  this  chapter,  we  introduced  trajectory-based  short-sighted  SSPs  and  greedy  short-sighted  SSPs. 
Trajectory-based  short-sighted  SSPs  prune  states  in  which  every  trajectory  that  reach  them  have 
probability  less  than  5.  Greedy  short-sighted  SSPs  perform  a  best-first  search  in  the  state  space 
of  the  original  problem  using  H(s')/ Pmax(s,  s')  as  evaluation  function,  where  PnVrlx  (s,  s')  is  the 
maximum  probability  of  a  trajectory  from  s  (the  initial  state  of  the  short-sighted  SSP)  to  s'; 
the  search  stops  when  the  search  tree  contains  k  or  more  states  and  the  visited  states  are  used 
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as  the  short-sighted  state  space  and  the  leaves  as  artificial  goals.  We  also  presented  a  set  of 
sufficient  conditions  for  any  short-sighted  SSP  definition  in  which  SSiPP  always  terminates  and 
is  asymptotically  optimal. 


48 


Chapter  4:  General  Short-Sighted  Models 


Chapter  5 

Extending  SSiPP 


In  this  chapter,  we  examine  how  to  combine  SSiPP  (Section  3.3)  with  commonly  used  prob¬ 
abilistic  planning  techniques,  e.g.,  labeling  of  converged  states  and  determinizations  [Trevizan 
and  Veloso,  2013].  We  begin  by  adding  a  labeling  mechanism  to  SSiPP  in  order  to  keep  track 
of  states  that  already  converged  to  their  e-optimal  solution  and  avoid  revisiting  them.  Next,  in 
Section  5.2,  we  extend  SSiPP  to  multi-core  processing  by  generating  and  solving  multiple  short¬ 
sighted  SSPs  in  parallel.  In  Section  5.3,  we  show  how  to  combine  SSiPP  with  determinizations 
in  order  to  compute  sub-optimal  solutions  more  efficiently.  The  empirical  comparison  between 
SSiPP,  the  algorithms  proposed  in  this  chapter,  and  other  state-of-the-art  probabilistic  planners 
is  presented  in  Chapter  7. 


5.1  Labeled  SSiPP 

As  described  in  Section  3.3,  SSiPP  obtains  the  next  state  s'  from  the  current  state  s  by  either  exe¬ 
cuting  or  simulating  the  optimal  policy  7T§  t  of  the  current  short-sighted  SSP  §Sjt  (Algorithm  3.2 
Line  11).  This  procedure  is  repeated  until  s'  is  a  goal  state,  either  from  the  original  SSP  or  an 
artificial  goal  of  §s,t. 

RTDP  (Section  2.3.1)  employs  a  similar  technique:  the  next  state  s'  is  obtained  by  either 
executing  or  simulating  nv ,  i.e.,  the  greedy  action  according  to  the  current  estimate  V  of  V*. 
This  approach  can  be  seen  as  an  unbiased  sampling  of  the  next  state;  therefore,  more  likely 
successor  states  are  updated  more  often.  However,  the  e-convergence  of  a  given  state  s  depends 
on  all  its  reachable  successors  [Bonet  and  Geffner,  2003],  thus  unlikely  successors  should  also  be 
visited.  As  a  result,  for  a  given  state  s,  unbiased  sampling  might  not  update  unlikely  successors 
of  s  frequently,  thus  delaying  the  overall  e-convergence  to  V*. 
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1  CheckSolved(SSP  §  =  (S.  s0,  G,  A,  P,  C),  state  s  e  S,  value  function  V,  solved  C  S, 
e  >  0) 

2  begin 

3  conv  true 

4  open  <-  Empty-Stack 

s  closed  <-  Empty-Stack 

6  if  s  solved  then  open. PuSH(s) 

7  while  not  open. isEMPTYf)  do 

8  s  open. Pop() 

9  c/osed.PuSH(s) 

10  if  s  e  (G  U  solved )  then  CONTINUE 

11  if  P(s,  V)  >  e  then 

12  conn  false 

o  Continue 

14  foreach  s'  s.t.  P(s'|s,  7rv  (s))  >  0  do 

is  ^  if  s'  ^  ( solved  U  open  U  closed)  then  open. Push(s') 

16  if  conv  =  true  then 

17  foreach  s'  e  closed  do 

is  solved  4—  solved  U  {s'} 

19  else 

20  while  not  closed.  isEmpty()  do 

21  s  <—  closed. Pop() 

22  |_  V(s)  <-  (. BV)(s ) 

23  return  ( solved ,  V) 

Algorithm  5.1:  CheckSolved  algorithm  used  by  Labeled  RTDP  [Bonet  and  Geffner,  2003]. 

Labeled  RTDP  (LRTDP)  [Bonet  and  Geffner,  2003]  extends  RTDP  by  tracking  the  states 
which  the  estimate  V  of  V*  has  already  e-converged  and  not  visiting  these  states  again.  In 
order  to  find  and  label  the  e-converged  states,  the  procedure  CheckSolved  (Algorithm  5.1) 
is  introduced.  Given  a  state  s,  CheckSolved  searches  for  states  s'  reachable  from  s  when 
following  the  greedy  policy  nl  such  that  R(s'.  V )  >  e.  If  no  such  state  s'  is  found  then  s  and  all 
the  states  in  S77'  reachable  from  s  have  e-converged  and  they  are  labeled  as  solved.  Alternatively, 
if  there  exists  s'  reachable  from  s  when  following  the  greedy  policy  7rv  such  that  R(s' .  V )  >  e, 
then  a  Bellman  backup  is  applied  on  at  least  V (s').  A  key  property  of  CheckSolved  is  that,  if 
V  has  not  e-converged,  then  a  call  to  CheckSolved  either  improves  V  or  labels  a  new  state  as 
solved;  formally: 

Theorem  5.1  ([Bonet  and  Geffner,  2003,  Theorem  4]).  Given  an  SSP  §  =  (S.  s0,  G,  A,  P,  C) 
that  satisfies  Assumption  2.1,  e  >  0,  and  a  monotonic  lower  bound  V  for  V*,  then  a  call  of 
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CHECKSOLVED(S),  s,  V,  solved,  e)  for  s  solved,  that  returns  (solved' ,V'),  either:  labels  a 
state  as  solved,  i.e.,  solved'  >  \ solved];  or,  there  exists  s'  G  S  such  that  V'(s')  —  V (s')  >  e. 

Using  the  solved  labels,  the  sampling  procedure  of  LRTDP  can  be  seen  as  a  case  of  rejection 
sampling:  if  the  sampled  successor  s'  of  s  is  marked  as  solved,  restart  the  procedure  from  the 
initial  state  s0,  otherwise  use  s'.  This  new  sampling  procedure  gives  LRTDP  both  a  better  anytime 
performance  and  a  faster  convergence  to  the  e-optimal  solution  when  compared  to  RTDP 

Labeled-SSiPP  (Algorithm  5.2)  is  an  extension  of  SSiPP  that  incorporates  the  labeling  mech¬ 
anism  of  LRTDP  and  uses  the  Checks  olved  procedure.  Since  the  states  marked  as  solved  have 
already  e-converged,  there  is  no  need  to  further  explore  and  update  them;  therefore  the  solved 
states  are  also  considered  as  artificial  goals  for  the  generated  short-sighted  SSPs  (Algorithm  5.2 
Line  10).  By  adding  the  solved  states  to  the  goal  set  of  the  generated  short-sighted  SSPs,  any 
algorithm  used  as  c-Optimal-SSP-Solver  (Line  13)  will  implicitly  take  advantage  of  the  la¬ 
beling  mechanism,  i.e.,  the  search  is  stopped  once  a  solved  state  is  reached. 

The  simulation  of  the  current  short-sighted  SSP  (Algorithm  5.2  Line  16)  for  Labeled-SSiPP 
finishes  when  the  state  s  is  either:  (i)  a  goal  state  of  the  original  problem;  (ii)  a  solved  state;  or 
(iii)  an  artificial  goal.  Only  in  the  last  case  the  algorithm  continues  to  generate  short-sighted  SSPs. 
Thus,  Labeled-SSiPP  (as  LRTDP)  also  employs  rejection  sampling:  if  a  solved  state  is  sampled, 
then  the  search  restarts  from  the  initial  s0. 

Besides  the  empirical  advantage  of  LRTDP  over  RTDP  [Bonet  and  Geffner,  2003],  the  la¬ 
beling  mechanism  also  allows  to  upper  bound  the  maximum  number  of  iterations  necessary  for 
LRTDP  to  converge  to  the  e-optimal  solution.  This  same  upper  bound  holds  for  Labeled-SSiPP: 

Corollary  5.2.  Given  an  SSP  §  =  (S.  sq.  G.  A.  P.  C)  that  satisfies  Assumption  2.1,  e  >  0,  t  6  N* 
and  a  monotonic  heuristic  H  for  V*,  then  Labeled-SSiPP  (Algorithm  5.2)  reaches  e-convergence 
after  at  most  e-1  ^sgS  |  V*  (s)  —  H(s )]  iterations  of  the  loop  in  Line  5. 

Proof.  In  each  iteration  of  the  loop  in  Line  5  of  Algorithm  5.2,  CheckSolved  is  called  for 
at  least  one  state  s  f  solved,  since  s0  f  solved.  By  Theorem  5.1,  after  CheckSolved  is 
called  for  s,  either:  (i)  s  e  solved ;  or  (ii)  there  exists  s'  f  solved  reachable  from  s  when 
following  the  greedy  policy  nv  such  that  U(s')  —  Uold(s')  >  e,  where  Uold  denotes  V  before  the 
CheckSolved  call.  Thus,  in  the  worst  case,  each  CheckSolved  call  improves  V  for  exactly 
one  state  s'  f  solved.  Therefore,  CheckSolved  is  called  at  most  e_1  X^es  [^*(s)  —  #(s)] 
times  before  so  G  solved,  which  is  the  termination  condition  for  the  loop  in  Line  5.  □ 

We  empirically  compare  the  e-convergence  time  of  Labeled-SSiPP  and  other  state-of-the-art 
planners  in  Chapter  7.  In  the  next  section,  we  show  how  to  extend  Labeled-SSiPP  to  multi-core 
computing  by  generating  and  solving  multiple  short-sighted  SSPs  in  parallel. 
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1  Labeled-SSiPP(SSP  §  =  (S,  s0,  G,  A,  P,  C ),  t  e  N*,  H  a  heuristic  for  V*,  e  >  0) 

2  begin 


3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 


V  <—  Value  function  for  S  with  default  value  given  by  H 

solved  -t—  0 
while  s0  ^  solved  do 
s  <—  So 

visited  <-  Empty-Stack 
while  s^(GU  solved)  do 

SS)t  <-  Generate-Short-Sighted-SSP(S,s,  V,t) 
foreach  s'  e  SsJ  do 

if  s'  G  solved  then 

L  u  {s/} 

(7r§st,  Vgs  J  <—  e-OPTiMAL-SSP-SOLVER(§s  j,  V,  e) 
foreach  s'  €  S71®*-4  \  Gs  t  do 
Z(s')^KTst(s') 


while  s  ^  Gs jt  do 

visifedPuSH(s) 
s  <—  Apply- Action^ 


s,t 


».s) 


while  not  visited  isEmpty()  do 

s  G-  visited. Pop() 

( solved ,  V)  CheckSolved(§,  s,  V,  solved,  e) 

if  s  ^  solved  then 
break 


return  V 


Algorithm  5.2:  Labeled  SSiPP:  version  of  SSiPP  that  incorporates  the  LRTDP  labeling  mech¬ 
anism.  Checks olved  is  presented  in  Algorithm  5.1. 


5.2  Parallel  Labeled  SSiPP 
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Deterministic  planners  have  benefited  from  parallelism  to  compute  both  optimal  and  suboptimal 
solutions.  Different  approaches  have  been  proposed,  e.g.,  search  space  abstraction  [Burns  et  al., 
2009,  Bums  et  al.,  2010,  Zhou  et  al.,  2010],  hashing  [Zhou  and  Hansen,  2007,  Kishimoto  et  al., 
2009,  Kishimoto  et  al.,  2010],  and  parallel  successor  generation  [Vidal  et  al.,  2010,  Sulewski 
et  al.,  2011]. 

For  discounted  infinite-horizon  MDPs  (Section  2.1),  i.e.,  probabilistic  planning  problems 
without  goal  states,  parallel  solvers  have  been  proposed  [Archibald  et  al.,  1993,  Archibald  et  al., 
1995].  These  planners  extend  asynchronous  value  iteration  to  perform  updates  in  parallel.  Al¬ 
though  these  approaches  can  be  applied  to  SSPs,  they  do  not  exploit  the  problem’s  structure, 
e.g.,  the  initial  state  and  set  of  goals  states.  Therefore,  parallel  MDP  solvers  always  explore  the 
complete  state  space,  including  irrelevant  states  [Barto  et  al.,  1995]. 

It  is  important  to  notice  that  finding  the  optimal  solution  of  a  deterministic  and  a  probabilistic 
planning  problem  belong,  respectively,  to  the  NC  and  P-complete  complexity  classes  in  their 
enumerative  representation  [Papadimitriou  and  Tsitsiklis,  1987].  In  other  words,  the  optimal  so¬ 
lution  of  a  deterministic  planning  problem  can  be  efficiently  found  using  a  parallel  algorithm, 
while  it  is  unlikely  that  optimal  algorithms  to  solve  probabilistic  planning  problem  can  be  ef¬ 
ficiently  parallelized.1  However,  problems  in  which  a  given  set  of  states  L  must  be  visited  in 
order  to  reach  the  goal  can  take  advantage  of  parallelization.  To  illustrate  how  parallelization  can 
speedup  some  probabilistic  planning  problems,  consider  the  Hallway  problem  example: 

Example  5.1  (Hallway  problem).  In  the  hallway  problem,  a  robot  has  to  navigate  a  grid  com¬ 
posed  by  k  rooms  of  size  r  each  while  avoiding  the  hazard  locations  and  walls.  The  rooms  form 
a  line  and  each  room  is  connect  to  the  next  by  a  single  door.  Figure  5.1  shows  an  example  of  grid 
for  k  =  3  and  r  =  5.  Every  time  the  robot  enters  a  hazard  location,  it  breaks  with  probability  0.9. 
Thus,  the  state  space  S  for  the  hallway  problem  is  composed  by  pairs  (/,  h ),  where  l  is  a  location 
in  the  grid  and  b  is  a  boolean  variable  indicating  if  the  robot  is  broken  or  not.  The  initial  state  s0 
is  (d0,  false)  and  the  goal  is  to  reach  the  last  door  and  be  not  broken,  i.e.,  G  =  {( dk ,  false)}.  Five 
actions  are  available  in  the  hallway  problem:  move  north,  south,  east  and  west,  and  fix-robot.  If 
the  robot  is  not  broken,  the  move  actions  succeeds  with  probability  0.9  and  move  the  robot  to 
the  given  direction;  and  fails  with  probability  0.1  by  not  moving  the  robot.  When  the  robot  is 
broken,  the  move  actions  do  not  change  the  current  state.  If  the  robot  is  broken,  the  fix-robot 
action  deterministically  fixes  the  robot  and  moves  it  to  d0,  i.e.,  P(s0\(l,  true),  fix-robot)  =  1  for 


'it  is  unlikely  due  to  the  unproven  assumption  that  NC  /  P. 
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Figure  5.1:  Grid  of  the  hallway  problem  (Example  5.1)  for  k  =  3  and  r  =  5.  In  this  problem, 
a  robot  “R”  has  to  navigate  between  rooms  from  location  d0  to  location  d3  while  avoiding  the 
hazard  locations  (grey)  and  walls  (black). 

all  locations  l.  When  the  robot  is  not  broken,  fix-robot  does  not  change  the  current  state.  The 
cost  of  the  move  actions  are  1  and  the  cost  of  fix-robot  is  10. 

Since  there  is  a  single  door  connecting  each  room,  we  can  decompose  a  fc-r-hallway  problem 
into  k  instances  of  a  1-r -hallway  problem,  where  the  initial  state  and  goal  set  of  the  z-th  problem 
are,  respectively,  (d*_ i,  false)  and  {(d*,  false)}  for  i  e  (1,  •  •  •  ,  k}.  Therefore,  we  can  compute 
an  optimal  policy  for  a  fc-r- hall  way  problem  by  combining  an  optimal  policy  for  each  one  of  its  k 
subproblems. 

In  this  section,  we  show  how  to  extend  Labeled-SSiPP  (Algorithm  5.2)  in  order  to  exploit  the 
structure  of  this  problem  by  solving  several  short-sighted  SSPs  in  parallel  and  then  combine  their 
solutions.  We  start  by  assuming  that  the  list  of  states  L  that  must  be  visited  to  reach  the  goal  is  given 
and  introduce  the  new  algorithm  Parallel  Labeled-SSiPP  (Section  5.2.1).  Then,  in  Section  5.2.2, 
we  present  a  method  based  on  landmarks  to  automatically  generate  the  list  of  states  L. 

5.2.1  Algorithm 

Parallel  Labeled-SSiPP,  shown  in  Algorithm  5.3,  extends  Labeled-SSiPP  by  solving  multiple 
short-sighted  SSPs  in  parallel  and  combining  their  solutions.  Precisely,  Parallel  Labeled-SSiPP 
launches  n  —  1  new  threads  (Lines  10  to  12),  each  one  with  their  own  copy  of  the  current  lower 
bound  V,  while  the  main  thread  solves  the  short-sighted  SSP  SSit  associated  with  the  state  s 
(Line  13).  Once  the  e-optimal  solution  for  §Sjt  is  obtained  by  the  main  thread,  all  the  n  —  1 
threads  are  stopped  and  their  current  lower  bound  V_t  is  merged  with  V  (Lines  14  to  16).  Each 
thread  selects  a  state  s'  E  L  using  a  thread-safe  procedure  to  avoid  duplicates  and  solves  the 
short-sighted  SSP  §s/)t;  if  the  e-optimal  solution  for  § v)t  is  obtained  before  being  stopped  by  the 
main  thread,  a  new  state  s'  E  L  is  selected  (Lines  25  to  28). 

A  copy  of  the  lower  bound  V  is  given  to  each  thread  (Algorithm  5.3,  line  10)  in  order  to  pre¬ 
vent  interference  between  threads  while  computing  the  solutions  of  the  short-sighted  SSPs.  To 
illustrate  such  interference,  consider  the  (s,  3) -depth-based  short-sighted  SSPs  S0  and  S2  associ- 
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1  Parallel  Labeled-SSiPP(SSP  §  =  (S.  s0,  G,  A,  P,  C),  t  eW,H  a  heuristic  for  V*, 
e  >  0,  n  G  N*) 

2  begin 

V  Value  function  for  S  with  default  value  given  by  II 

solved  G-  0 
while  s0  ^  solved  do 
s  •(—  So 

visited  <-  Empty-Stack 
while  s^(GU  solved )  do 
L  <-  Compute-L(S.s) 
foreach  i  e  {1, . . . ,  n  —  1}  do 
^  <r-  Make-Copy(V) 

Start-New-Thread(§,  t,  e,  L) 

(V,  7 r*)  A-  Solve-Short-Sighted(§,  s,  t,  V,  e) 

Stop-All-Threads() 

for  s'  G  S  do  in  parallel 

L  Hs')  max{V(s/),Zi(s/).  •  •  ■  ,En-i(s')} 

while  s  0  GA  t  do 

uisite<i.PuSH(s) 
s  <—  Apply- Action(7i§s  ((s),s) 

while  not  visited.  isEmpty()  do 

s  G-  visited. Pop() 

( solved ,  V_)  G-  Checks olved(§,  s,  V,  solved,  e) 
if  s  ^  solved  then  break 

return  V 

25  Start-New-Thread(SSP  §,  t  >  0,  V  a  lower  bound  for  V *,  e  >  0,  L  a  list  of  states) 

26  begin 

27  while  (s  <—  Thread-Safe(L.popQ))  do 

28  |_  V  g-  Solve-Short-Sighted(§,  s,  t,  V,  e) 

29  Solve-Short-Sighted(SSP  §,  s  G  S,  t  >  0,  V  a  lower  bound  for  V*,  e  >  0) 

30  begin 

31  §Sjt  <—  Generate-Short-Sighted-SSP(S,  s,  V,  t ) 

32  G Stt  <—  G Stt  U  ( solved  fi  SS)t) 

33  (7Tga  t,  V§a  t)  G-  e-OPTIMAL-SSP-SOLVER(§S)t,  V,  e) 

34  foreach  s'  G  5^ •*  \  Gst  do 

»  L  L(s')  <-  V£JS') 

*  return  K.  ,,L) 

Algorithm  5.3:  Parallel  version  of  Labeled-SSiPP  (Algorithm  5.2).  Stop-All-Threads 
(Line  14)  cancels  all  the  extra  threads  running  and  returns  immediately  to  the  main  thread. 
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(a)  (b) 


Figure  5.2:  Examples  of  (s,  f) -depth-based  short-sighted  SSPs  for  the  hallway  problem  in  Fig¬ 
ure  5.1.  The  patterned  cells  represent  the  locations  included  in  each  short-sighted  SSP.  For  both 
short-sighted  SSPs,  t  —  3,  and  s  equals  (d0,  false)  and  (d2,  false)  for  (a)  and  (b)  respectively. 

ated  with  the  hallway  problem  example  in  Figure  5.1  for  s  equal  to,  respectively,  s0  =  {do,  false) 
and  (d2,  false).  As  shown  in  Figure  5.2,  the  initial  state  s0  belongs  to  the  state  space  of  both  §0 
and  §2,  also  s0  is  an  artificial  goal  of  §2.  Thus,  if  S0  and  §2  are  solved  in  parallel  sharing  the 
same  lower  bound  V,  then  the  Bellman  updates  applied  on  V (s0)  when  solving  S0  affects  the  so¬ 
lution  of  S2?  therefore  there  is  no  guarantee  that  solution  computed  by  c-Optimal-SSP-Solver 
(Algorithm  5.3  Fine  13)  for  S>2  is  e-optimal  since  F(s0)  might  have  changed. 

Another  benefit  of  each  thread  manipulating  their  own  copy  V_{  of  V  is  that  Theorem  3.7 
guarantees  that  the  monotonicity  and  admissibility  of  each  is  preserved.  Once  the  (partial) 
solutions  from  all  threads  are  obtained,  they  are  combined  in  parallel  by  keeping  the  maximum 
over  all  lower  bounds  on  each  state  s  6  S  (Algorithm  5.3  Fine  15).  Clearly,  the  max  operator 
preserves  the  admissibility  of  V  and,  in  Femma  5.3,  we  prove  that  the  max  operator  also  pre¬ 
serves  the  monotonicity  of  a  value-function.  Therefore,  each  iteration  of  Parallel  Fabeled-SSiPP 
maintains  the  lower  bound  V  monotonic  and  admissible.  Corollary  5.4  extends  the  convergence 
bound  of  Fabeled-SSiPP  (Corollary  5.2)  to  Parallel  Fabeled-SSiPP. 

Lemma  5.3.  Given  an  SSP  §  =  (S,  so,  G,  A,  P,  C)  and  two  monotonic  value  functions  V\  and  V2 
for  §,  then  Vm,  defined  as  Vm(s)  =  max{F1(s),  V^s)},  is  also  a  monotonic  value  function  for  §. 

Proof  Suppose,  for  contradiction,  that  Vm  is  not  monotonic,  thus  there  exists  s  G  S  such  that 
Vm(s)  >  (BVm)(s).  Without  loss  of  generality  assume  V2(s)  <  Vi(s)  =  Vm(s).  If  there 
exist  s'  E  S  and  a  E  A  s.t.  P(s'\s,a)  >  0  and  V-2{s')  >  ^(s7),  then  either:  (i)  a  equals 
argmina,  E  [C(s,  a',  s')  +  Cm(s')|s,  a'},  and  therefore  (BVm)(s)  >  ( BVf){s )  >  Vi(s)  =  Vm(s); 
otherwise  (ii)  {BVm){s)  =  (BVf)(s)  >  Vi(s)  =  Vm(s). 

Alternatively,  if  there  exists  no  such  s',  then  Vfis')  <  V)  (s')  for  all  s'  E  S  and  a  E  A 
s.t.  P(s'\s,a)  >  0;  therefore  (iii)  (BVm)(s)  =  (BVi)(s)  >  Vi(s)  =  Vm{s).  By  (i)  -  (iii),  we 
have  that  Vm(s)  <  (. BVm)(s ),  and  a  contradiction  is  obtained.  □ 
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Room  Size  (r) 

5 

10 

15 

Number  of  Rooms  ( k ) 

5 

10 

15 

20 

5 

10 

15 

20 

5 

10 

15 

20 

Parallel  L.  SSiPP 

n  =  2 

1.85 

1.66 

1.61 

1.58 

1.46 

1.36 

1.36 

1.39 

1.29 

1.31 

1.27 

1.32 

n  =  4 

2.73 

2.53 

2.47 

2.42 

1.97 

1.98 

1.91 

1.93 

1.71 

1.74 

1.71 

1.74 

n  =  8 

2.95 

3.23 

3.17 

3.01 

2.05 

2.33 

2.31 

2.31 

1.85 

2.17 

2.16 

2.11 

Table  5.1:  Speedup  of  Parallel  Labeled-SSiPP,  for  different  number  of  parallel  threads  n, 
w.r.t.  Labeled-SSiPP  in  the  hallway  robot  domain.  Results  are  averaged  over  50  random  prob¬ 
lems  for  each  combination  of  r  and  k.  Best  performance  shown  in  bold. 


Corollary  5.4.  Given  an  SSP  §  =  (S,  so,  G.  A.  P.  C)  that  satisfies  Assumption  2.1,  e  >  0,  t  G  N*, 
n  G  N*  and  a  monotonic  heuristic  H  for  V*,  then  Parallel  Labeled-SSiPP  (Algorithm  5.3)  reaches 
e-convergence  after  at  most  e~l  IX*  (s)  —  /T (  a ) ]  iterations  of  the  loop  in  Line  5. 

Proof.  Each  iteration  of  the  loop  in  Line  5  solves  at  least  the  short-sighted  SSP  associated  with 
the  current  state  s,  i.e.,  if  n  =  1,  Parallel  Labeled-SSiPP  and  Labeled-SSiPP  are  equivalent. 
Since  the  max  operator  preserves  the  admissibility  and  monotonicity  of  V  (Lemma  5.3),  then 
this  proof  follows  from  Corollary  5.2.  □ 

To  illustrate  the  advantages  of  Parallel  Labeled-SSiPP  and  (sequential)  Labeled-SSiPP,  we 
present  an  experiment  comparing  both  of  them  in  randomly  generated  k-r- hallway  problems. 
Lor  this  experiment,  both  planners  use  the  Manhattan  distance  as  the  heuristic  H,  LRTDP  as  the 
e-optimal  solver  and  t  =  5  for  the  generation  of  depth-based  short-sighted  SSPs.  The  list  L  given 
to  Parallel  Labeled-SSiPP  contains  all  states  in  which  the  robot  is  not  broken  and  at  one  of  the 
internal  doors,  precisely,  L  =  {(di,  false),  (d2,  false), . . . ,  (c4_i,  false)}. 

We  generated  50  random  problems  for  each  combination  of  r  G  (5, 10, 15}  and  k  G  (5, 10, 
15,  20}.  Door  locations  are  chosen  uniformly  at  random  and  every  location  that  is  not  a  door,  is 
marked  as  hazard  with  probability  0.15.  Each  planner  is  run  until  e-convergence,  for  e  =  10”4, 
and  we  limit  the  runtime  and  memory  to  1  hour  and  4  GB,  respectively.  The  experiments  were 
conducted  on  a  Linux  machine  with  8  cores  running  at  2.40  GHz. 

Table  5.1  shows  the  results  averaged  over  the  50  random  problems  for  each  parametriza- 
tion.  Parallel  Labeled-SSiPP  outperforms  its  sequential  version  in  all  the  parametrizations.  The 
obtained  speedup  varies  from  1.85  to  3.23  when  8  threads  are  used.  As  expected,  we  see  the 
diminishing  returns  effect:  the  obtained  improvement  decreases  as  more  threads  are  added. 


5.2.2  Choosing  States  for  Parallel  Labeled  SSiPP 

In  this  section,  we  present  an  algorithm  to  compute  the  list  of  states  L  used  by  Parallel  Labeled-SSiPP 
to  build  short-sighted  SSPs.  Notice  that  L  can  be  seen  as  a  list  of  subgoals  of  the  original  problem. 
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Compute-L(SSP  §  =  (S,  s0,  G,  A,  P,  C ),  s  G  S) 

begin 

D  All-Outcomes-Determinization(S) 
Q  ■(-  Find-Landmarks(D,  s) 

P  Find-Shortest-Path(<T  s,  G) 

L  g-  Instantiate-All-Formulas(P) 


1 

2 

3 

4 

5 

6 

7 

8  end 


return  L 


Algorithm  5.4:  Landmark  approach  to  compute  L  for  Parallel  Labeled-SSiPP  (Algorithm  5.3). 
Q  is  a  graph  representing  the  landmarks  of  the  deterministic  problem  O.  INSTANTIATE- All- 
Formulas  generates  all  the  states  s'  G  S  such  that  at  least  on  landmark  in  P  is  true  in  s'. 


Figure  5.3:  Example  of  states  returned  by  Algorithm  5.4  from  the  initial  state  for  the  hallway 
problem  in  Figure  5.1.  The  patterned  cells  and  green  arrows  represents,  respectively,  the  vertices 
(landmarks)  and  arcs  (ordering)  of  the  path  P  (Algorithm  5.4  Line  5). 


e.g.,  the  states  (di,  false)  and  (d2.  false)  in  the  hallway  problem  example  (Figure  5.1).  Parallel 
Labeled-SSiPP  makes  no  assumption  w.r.t.  the  states  s  G  L  and  any  state  s  that  is  reachable  from 
s0  has  potential  to  generate  a  speedup. 

In  deterministic  planning,  one  approach  to  obtain  subgoals  is  through  landmarks  [Hoffman 
et  al.,  2004].  A  landmark  is  a  formula  over  the  problem’s  state  variables  (Section  2.2)  that 
must  be  true  at  some  point  during  the  execution  of  every  solution  that  reaches  the  goal.  Two 
landmarks  a  and  b  can  also  be  (partially)  ordered  according  to  different  constraints,  e.g.,  if  a  is 
true  some  time  before  b  and  if  a  is  always  true  one  step  before  b.  binding  landmarks  and  ordering 
them  is  computationally  expensive,  for  instance,  deciding  if  a  state  variable  is  a  landmark  is 
PSPACE-complete  [Hoffman  et  al.,  2004].  Therefore,  algorithms  to  automatically  find  (ordered) 
landmarks  relies  on  approximations  in  order  to  be  computationally  feasible. 

Our  approach  to  generate  L  for  a  given  §  is  to  obtain  partially  ordered  landmarks  for  the  all¬ 
outcomes  determinization  of  §  (Section  2.3.2)  and  post-process  them  in  order  to  remove  land¬ 
marks  that  have  already  being  met.  Algorithm  5.4  describe  our  method  to  generate  L  and  we  use 
the  Last-Downward  [Helmert,  2006]  landmark  identification  algorithm  as  Lind-Landmarks 
in  Line  4.  Ligure  5.3  shows  the  landmarks  selected  in  Line  4  from  the  initial  state  in  the  hallway 
example  in  Ligure  5.1. 


5.3  SSlPP-FF 


59 


Room  Size  (r) 

5 

10 

15 

Number  of  Rooms  (k) 

5 

10 

15 

20 

5 

10 

15 

20 

5 

10 

15 

20 

Parallel  L.  SSiPP 

n  =  2 

1.50 

1.41 

1.43 

1.37 

1.21 

1.19 

1.17 

1.14 

1.08 

1.13 

1.09 

1.11 

n  =  4 

2.07 

1.97 

1.91 

1.93 

1.38 

1.34 

1.29 

1.31 

1.20 

1.16 

1.13 

1.14 

n  =  8 

2.33 

2.17 

2.13 

2.06 

1.52 

1.43 

1.44 

1.42 

1.29 

1.22 

1.21 

1.21 

Table  5.2:  Speedup  of  Parallel  Labeled-SSiPP  using  Algorithm  5.4  to  generate  the  list  of  state  L 
in  the  hallway  robot  domain.  Results  are  averaged  over  50  random  problems  for  each  combina¬ 
tion  of  r  and  k.  Best  performance  shown  in  bold. 


We  repeated  the  series  of  random  hallway  problems  experiments  (Table  5.1)  following  the 
same  methodology  and  using  Algorithm  5.4  to  generate  the  list  L.  Table  5.2  presents  the  results 
as  average  speedup  with  respect  to  (sequential)  Labeled-SSiPP  over  the  50  random  problems  for 
each  parametrization.  Parallel  Labeled-SSiPP  using  Algorithm  5.4  still  outperforms  its  sequen¬ 
tial  version  in  all  the  parametrizations  and  the  speedup  varies  from  1.21  to  2.33  when  8  threads 
are  used.  As  expected,  the  speedup  decreases  with  respect  to  Parallel  Labeled-SSiPP  using  the 
list  of  doors  as  L,  i.e.,  Table  5.1.  There  are  two  reasons  for  this  decrease  in  performance,  the 
extra  overhead  of  computing  the  landmarks  and  the  extra  states  returned  by  Algorithm  5.4.  The 
latter  is  illustrated  in  Figure  5.3:  Algorithm  5.4  returns  the  locations  before  and  after  each  door 
location  <7*  because  they  are  the  only  locations  in  which  the  robot  can  reach  d,  . 


5.3  SSiPP-FF 

In  this  section,  we  show  how  to  combine  the  SSiPP  and  determinizations  in  order  to  improve 
the  scalability  of  SSiPP  while  dropping  SSiPP’s  optimality  guarantee.  This  extension  of  SSiPP, 
SSiPP-FF,  is  depicted  in  Algorithm  5.5.  After  reaching  an  artificial  goal  s,  SSiPP-FF  performs 
the  following  extra  steps  with  respect  to  SSiPP  (Algorithm  3.2):  (i)  compute  a  determinization  D 
of  the  original  SSP;  (ii)  runs  FF  to  solve  D  using  s  as  initial  state;  and  (iii)  executes  the  returned 
plan  until  failure  (Lines  12  to  17  in  Algorithm  5.5). 

Any  determinization  can  be  used  by  SSiPP-FF  (Line  13)  and  if  the  chosen  determinization  is 
stationary,  e.g.,  all-outcomes  and  most-likely  determinization,  then  the  deterministic  representa¬ 
tion  of  §  can  be  pre-computed  and  reused  in  every  iteration  to  generate  D.  Since  SSiPP-FF  does 
not  assume  any  specific  behavior  of  FF,  any  deterministic  planner  can  be  used  for  solving  D  in 
Line  14  instead  of  FF. 

Besides  taking  advantage  of  potential  non-optimal  solutions,  SSiPP-FF  also  improves  the 
behavior  of  FF-Replan  by  not  reaching  avoidable  dead  ends  in  the  generated  short-sighted  SSPs. 
Formally,  suppose  that  a  short-sighted  SSP  §Sjt  generated  in  Line  6  of  Algorithm  5.5  has  an 
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1  SSlPP-FF(SSP  §  =  (S.  s0,  G,  A ,P,C),t  E  N*,  H  a  heuristic  for  V*,  e  >  0) 

2  begin 


V  <—  Value  function  for  S  with  default  value  given  by  H 

s  <—  So 

while  s  0  G  do 

§s>t  e-  Generate-Short-Sighted-SSP(S,s,  V,t) 
,  VI  )  E-  e-C>PTIMAL-SSP-SOLVER(§S)t,  V,  e) 


8  foreach  s'  E  4  \  Gs  *  do 

9 


10 

11 


while  s  jL  Gs )t  do 

|  s  E-  execute-action(7r|s  ( (s)) 


12 

13 

14 

15 

16 
17 


if  s  ^  G  then 

D  e-  Determinize(S) 

(si,  au  s 2,  •  •  • ,  Ofc-i,  Sfc)  E-  CallFF(D,  s) 
for  i  G  {1, . . . ,  k  —  1}  do 

if  s  Si  then  break 

s  <r-  Apply- AcTiON(aj,s) 


18 


return  V 


Algorithm  5.5:  SSiPP-FF:  version  of  SSiPP  that  incorporates  determinizations  to  obtain  a 
non-optimal  solution  efficiently. 


avoidable  dead  end,  i.e.,  there  exist  at  least  one  proper  policy  for  S.s,/  and  every  proper  policy  for 
§s>t  is  closed  but  not  complete.  Since  an  e-optimal  policy  7r|s  t  is  computed  for  (Line  7),  then 
7 t  is  one  of  the  existing  proper  policies  by  the  definition  of  optimal  policies.  Therefore  the 
avoidable  dead  ends  are  not  reached  by  executing  7r§s  . 

Notice  that  the  guarantee  of  not  reaching  avoidable  dead  ends  that  are  included  in  the  current 
short-sighted  SSP  is  not  due  to  SSiPP-FF.  Instead,  this  guarantee  is  inherited  from  SSiPP.  We 
finish  this  section  by  introducing  and  analyzing  the  jumping  chain  problems  (Example  5.2),  a 
series  of  problems  in  which  SSiPP-FF  avoids  all  dead  ends  while  determinization  approaches 
based  on  shortest  distance  to  goal,  e.g.,  FF-Replan,  reach  a  dead  end  with  probability  exponen¬ 
tially  large  in  the  problem  size. 

Example  5.2  (Jumping  Chain).  For  k  E  N*,  the  A-th  jumping  chain  problem  has  3k  +  1  states: 
S  =  {s0,  si,  •  •  •  ,  S2k,  T\ ,  r2,  •  •  •  ,  Tk}.  The  initial  state  is  s0  and  the  goal  set  is  G  =  {s2fc}.  Two 
actions  are  available,  aw  (walk)  and  aj  (jump),  and  their  costs  are,  respectively,  1  and  3  indepen¬ 
dently  of  the  current  and  resulting  state.  The  walk  action  is  deterministic:  P(si+i|sj,  aw)  =  1 
for  all  i,  P(sj_i|rj,  aw)  =  1  for  i  odd;  and  Pirjr,,  aw)  =  1  for  i  even.  When  aj  is  applied  to  st, 
for  i  even,  the  resulting  state  is  Si+2  with  probability  0.75  and  rl+  \  with  probability  0.25;  if  i  is 


5.3  SSlPP-FF 


61 


Figure  5.4:  Representation  of  the  jumping  chain  problem  (Example  5.2)  for  k  =  3.  The  initial 
state  is  s0>  the  goal  set  is  G  =  {s6}.  Actions  aw  and  aj  have  cost  1  and  3  respectively. 


odd,  then  a,j  does  not  change  the  current  state,  i.e.,  P(s;|s;,  aj )  =  1.  For  the  states  r*,  aj  is  such 
that:  P{ri\r^sj)  =  1  for  even  i;  and,  for  odd  i,  P{si+1\ri:  sj)  =  0.75  and  P{ri+1\ri:  sj)  =  0.25. 
Notice  that,  for  all  i  even,  r\  is  a  dead  end.  Figure  5.4  shows  the  jumping  chain  problem  for  k  =  3. 

In  the  jumping  chain  problems,  FF-Replan  using  both  the  most-likely  outcomes  and  all¬ 
outcomes  determinization  are  equivalent  because  the  low  probability  effect  of  jump,  i.e.,  move  to 
a  state  ry,  is  less  helpful  than  its  most-likely  effect.  When  in  a  state  r,;,  for  i  odd,  FF-Replan  never 
chooses  action  walk  because:  (i)  walk  results  in  a  state  further  away  from  the  goal;  and  (ii)  jump 
has  a  non-zero  probability  to  reach  a  state  in  which  the  goal  is  still  achievable.  Therefore,  the 
solutions  obtained  by  FF-Replan  have  non- zero  probability  of  reaching  a  dead  end,  i.e.,  a  state  r, 
for  i  even.  Formally,  the  probability  of  FF-Replan  reaching  the  goal  for  the  k- th  jumping  chain 
problem  is  (2 p  —  p2)k  for  p  =  P(sj+2|sj,  aj ). 

Alternatively,  SSiPP-FF  always  reaches  the  goal  for  t  G  N*  and  the  following  trivial  ex¬ 
tension  of  the  zero-heuristic:  hd(s)  =  oo  if  P(.sjs.  a)  =  1  for  all  a  G  A  and  hd(s)  =  0  oth¬ 
erwise.  Formally,  a  dead  end  r,  (for  i  even)  can  only  be  reached  when  aj  is  applied  in  r,_  | 
and,  in  order  to  show  that  SSiPP-FF  never  reaches  r\,  we  need  to  show  that:  (i)  7r§  t  gener¬ 
ated  on  Fine  7  never  applies  aj  on  rt ;  and  (ii)  if  rr  e  GS)t,  then  7t§  does  not  reach  rr  since 
the  determinization  part  of  SSiPP-FF  (Fine  14)  would  apply  aj.  The  former  case  is  true  since 
7Tgst  is  the  e-optimal  solution  and  hd(sw i)  =  0  <  hd(ri+1)  =  oo,  therefore  7T§  (r*)  =  aw.  In 
the  latter  case,  if  rt  G  Gs>t,  then  {si,si+i}  C  GSyt.  Since  hd(ri )  =  hd{si)  =  hd(si+1)  =  0  and 
C(si- 1,  aw,  Si)  =  1  <  C(si- 1,  aj ,  ■)  =  3,  then  7r§s  t(si-i)  =  aw  and  the  value  of  s  in  Fine  14  of 
SSiPP-FF  is  Therefore,  SSiPP-FF  using  hd  always  reaches  the  goal  for  t  G  N*.  Note  that 
SSiPP-FF  can  obtain  a  speedup  over  SSiPP  in  the  jumping  chain  problems  if  the  determinization 
solution  can  be  efficiently  obtained. 
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5.4  Summary 

In  this  chapter,  we  presented  three  extensions  of  SSiPP:  Labeled-SSiPP,  Parallel  Labeled-SSiPP 
and  SSiPP-FF.  Labeled-SSiPP  improves  the  convergence  time  of  SSiPP  to  the  e-optimal  solution 
by  labeling  states  that  have  already  e-converged  as  solved.  Solved  states  are  not  revisited  by 
the  during  the  search  for  the  e-optimal  solution  and  are  also  pruned  from  the  short-sighted  SSP 
since  an  e-optimal  solution  from  these  labeled  states  is  already  known.  Parallel  Labeled-SSiPP 
extends  Labeled-SSiPP  by  generating  and  solving  multiple  short-sighted  SSPs  in  parallel.  For 
both  Labeled-SSiPP  and  Parallel  Labeled-SSiPP,  we  proved  an  upper  bound  on  the  number  of 
iterations  necessary  for  them  to  converge  to  the  e-optimal  solution. 

We  also  introduced  SSiPP-FF,  a  planner  that  combines  SSiPP  with  determinizations  in  order 
to  compute  sub-optimal  solutions  more  efficiently.  Besides  improving  the  scalability  of  SSiPP, 
we  show  how  SSiPP-FF  can  make  FF-Replan  safer  by  avoiding  dead  ends  within  the  solved 
short-sighted  SSPs. 

In  the  next  chapter,  we  present  the  previous  work  in  optimal  and  suboptimal  probabilistic 
planning  and  how  they  relate  to  SSiPP,  Labeled-SSiPP,  Parallel  Labeled-SSiPP  and  SSiPP-FF. 
Then,  in  Chapter  7,  we  empirically  compare  our  algorithms  against  the  state-of-the-art  proba¬ 
bilistic  planners. 


Chapter  6 
Related  Work 


This  chapter  presents  a  review  of  related  work  in  probabilistic  planning.  Probabilistic  planners, 
i.e.,  algorithms  that  return  closed  policies,  are  reviewed  in  Sections  6.1  to  6.3;  and  replanners, 
algorithms  that  return  partial  policies,  are  reviewed  in  Section  6.4.  Section  6.5  presents  how  this 
thesis  fits  with  respect  to  the  presented  related  work. 


6.1  Extensions  of  Value  Iteration 

One  direct  extension  of  Value  Iteration  (VI),  presented  in  Section  2.1,  is  Topological  Value  It¬ 
eration  (TVI)  [Dai  and  Goldsmith,  2007].  TVI  pre-processes  the  given  SSP  by  performing  a 
topological  analysis  of  the  state  space  S.  The  result  of  this  analysis  is  a  set  of  the  strongly  con¬ 
nected  components  (SCCs)  and  TVI  solves  the  SSP  by  applying  VI  on  each  SCC  in  reversed 
topological  order,  i.e.,  from  the  goals  to  the  initial  state.  This  decomposition  can  speed  up  the 
search  of  e-optimal  solutions  when  the  original  SSP  can  be  decomposed  into  several  close-to- 
equal-size  SCCs.  In  the  worst  case,  when  the  SSP  has  just  one  SSC,  TVI  performs  worst  than  VI 
due  to  the  overhead  imposed  by  the  topological  analysis. 

To  increase  the  chances  that  a  problem  will  be  decomposed  in  several  close-to-equal-size 
SCCs,  Focused  Topological  Value  Iteration  (FTVI)  [Dai  et  al.,  2009]  was  introduced.  FTVI 
performs  a  best-first  forward  search  in  which  a  lower  bound  V  for  V*  is  iteratively  improved  and 
actions  that  are  provably  sub-optimal  are  removed  from  the  original  SSP.  Once  the  R(S,  V)  is 
small,  the  search  is  stopped  and  the  resulting  SSP  is  solved  using  TVI  and  V  as  lower  bound. 
Since  the  removed  actions  are  always  sub-optimal,  FTVI  returns  an  e-optimal  solution.  In  the 
worst,  FTVI  is  equivalent  to  TVI  since  there  is  no  guarantee  that  any  action  will  be  removed 
from  the  original  SSP. 
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6.2  Real  Time  Dynamic  Programming  and  Extensions 


Another  extension  of  VI  is  Real  Time  Dynamic  Programming  (RTDP)  [Barto  et  al.,  1995],  pre¬ 
sented  in  Section  2.3.1.  RTDP  extends  the  asynchronous  version  of  VI  by  using  greedy  search 
and  sampling  to  find  the  next  state  to  perform  a  Bellman  backup.  In  order  to  avoid  being  trapped 
in  loops  and  to  find  an  e-optimal  solution,  RTDP  updates  its  lower  bound  V(s)  of  V*  (s  ')  on  every 
state  s  visited  during  the  search.  If  Assumption  2.1  holds  for  the  given  SSP,  then  RTDP  always 
finds  an  e-optimal  solution  after  several  search  iterations  (possibly  infinitely  many),  i.e.,  RTDP  is 
asymptotically  optimal.  Differently  from  the  VI,  TVI  and  FTVI  that  compute  complete  policies, 
RTDP  returns  a  closed  policy  if  e  is  small  enough  or  a  partial  policy  otherwise. 

Several  extensions  of  RTDP  have  been  proposed  and  the  first  one  is  Labeled  RTDP  (LRTDP) 
[Bonet  and  Geffner,  2003].  LRTDP  introduces  a  labeling  mechanism  to  find  states  that  have 
already  e-converged  and  avoids  exploring  these  converged  states  again.  With  this  technique, 
LRTDP  provides  an  upper  bound  on  the  number  of  iterations  necessary  to  find  an  e-optimal 
solution. 

The  following  three  algorithms  also  extend  RTDP  by  maintaining  a  lower  and  an  upper  bound 
V  on  V*  and  providing  different  methods  to  direct  the  exploration  of  the  state  space:  Bounded 
RTDP  (BRTDP)  [McMahan  et  al.,  2005],  Focused  RTDP  (FRTDP)  [Smith  and  Simmons,  2006] 
and  Value  of  Perfect  Information  RTDP  (VPI-RTDP)  [Sanner  et  al.,  2009].  The  advantage  of 
keeping  an  upper  bound  is  that  the  exploration  of  the  state  space  can  be  biased  towards  states  s 
in  which  the  uncertainty  about  V*(s)  is  large,  e.g.,  the  gap  between  V ( s )  and  V_(s)  is  large. 

This  improved  criterion  to  guide  the  search  decreases  the  number  of  Bellman  backups  re¬ 
quired  to  find  an  e-optimal  solution;  however,  each  iteration  of  the  search  is  considerably  more 
expensive  due  to  the  maintenance  of  the  upper  bound  V.  Although  no  clear  dominance  exists 
between  RTDP  and  its  extensions,  empirically  it  has  been  shown  that  in  most  of  the  problems: 
(i)  RTDP  is  outperformed  by  all  its  extensions;  and  (ii)  VPI-RTDP  outperforms  BRTDP  and 
FRTDP. 

The  extensions  of  RTDP  mentioned  so  far  are  concerned  with  improving  the  convergence 
of  RTDP  to  the  e-optimal  solution,  and  ReTrASE  [Kolobov  et  al.,  2009]  extends  RTDP  in  or¬ 
der  to  improve  its  scalability.  ReTrASE  achieves  this  by  projecting  V  into  a  lower  dimensional 
space.  The  set  of  basis  functions  used  by  ReTrASE  is  obtained  by  solving  the  all-outcomes  deter- 
minization  of  the  original  problem  (Section  2.3.2).  Due  to  the  lower  dimensional  representation, 
ReTrASE  is  non-optimal. 
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6.3  Policy  Iteration  and  Extensions 

A  different  approach  for  finding  e-optimal  solutions  is  Policy  Iteration  (PI)  [Howard,  I960]. 
PI  performs  search  in  the  policy  space  and  iteratively  improves  the  current  policy  until  no  further 
improvement  is  possible,  i.e.,  an  optimal  policy  is  found.  Since  PI  was  originally  designed  for 
infinite-horizon  MDPs,  it  returns  a  complete  policy;  therefore,  when  applied  to  SSPs,  PI  does  not 
take  advantage  of  the  initial  state  s0  to  prune  its  search. 

LAO*  [Hansen  and  Zilberstein,  2001]  can  be  seen  as  a  version  of  PI  which  takes  advan¬ 
tage  of  s0  and  computes  e-optimal  closed  policies  that  are  potentially  not  complete.  Precisely, 
LAO*  computes  a  closed  e-optimal  policy  for  the  sequence  S0  C  Si  C  . . .  S*  C  S,  where 

50  =  {so}>  be.,  S0  contains  only  the  initial  state,  S  is  the  complete  state  space  of  the  SSP  and 

51  is  generated  by  greedily  expanding  S,_  ] .  LAO*  stops  when  S"  C  S,;  therefore  the  closed 
e-optimal  policy  for  S"  is  also  e-optimal  for  the  original  problem.  Improved  LAO*  (ILAO*) 
[Hansen  and  Zilberstein,  2001]  enhance  LAO*  performance  by  increasing  how  many  states  are 
added  to  Sj_i  to  generate  S,  and  performing  single  Bellman  Backups  in  a  depth-first  postorder 
traversal  of  S,  instead  of  using  PI  or  VI  to  compute  e-optimal  solutions  to  St. 


6.4  Replanners 

Another  direction  to  solve  probabilistic  planning  problem  is  replanning.  One  of  the  simplest,  yet 
powerful,  replanners  is  FF-Replan  [Yoon  et  al.,  2007],  presented  in  Section  2.3.2.  Given  a  state  s 
(initially  s  equals  so)  FF-Replan  generates  the  all-outcomes  determinization  D  of  the  SSP  § 
being  solved  and  uses  the  deterministic  planner  FF  [Hoffmann  and  Nebel,  2001]  to  solve  D  from 
state  s.  The  solution  n  for  D  is  then  applied  to  the  §;  if  and  when  the  execution  of  n  fails  in  the 
probabilistic  environment,  FF  is  re-invoked  to  plan  again  from  the  failed  state.  FF-Replan  was 
the  winner  of  the  first  International  Probabilistic  Planning  Competition  (IPPC)  [Younes  et  al., 
2005]  in  which  it  outperformed  the  probabilistic  planners  due  to  their  poor  scalability.  Despite 
its  major  success,  FF-Replan  is  non-optimal  and  oblivious  to  probabilities  and  dead  ends,  leading 
to  poor  performance  in  probabilistic  interesting  problems  [Little  and  Thiebaux,  2007],  e.g.,  the 
triangle  tire-domain  (Section  4. 1 .2). 

FF-Hindsight  [Yoon  et  al.,  2008]  is  a  non-optimal  replanner  that  generalizes  FF-Replan  based 
on  hindsight  optimization.  Given  a  state  s,  FF-Hindsight  performs  the  following  three  steps: 
(i)  randomly  generate  a  set  of  non- stationary  deterministic  problems  D  starting  from  s;  (ii)  use 
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FF  to  solve  each  problem  in  D;  and  (iii)  combine  the  cost  of  their  solutions  to  estimate  the 
true  cost  of  reaching  a  goal  state  from  s.  Each  deterministic  problem  in  D  has  a  fixed  horizon 
and  is  generated  by  sampling  one  outcome  of  each  probabilistic  action  for  each  time  step.  This 
process  reveals  two  major  drawbacks  of  FF-Hindsight:  (i)  a  bound  in  the  horizon  size  of  the 
problem  is  needed  in  order  to  produce  the  relaxed  problems;  and  (ii)  rare  effects  of  actions  might 
be  ignored  by  the  sampling  procedure.  While  the  first  drawback  is  intrinsic  to  the  algorithm,  a 
workaround  to  the  second  one  is  proposed  [Yoon  et  al.,  2010]  by  always  adding  the  all-outcomes 
determinization  of  the  problem  to  D  and,  therefore,  ensuring  that  every  effect  of  an  action  appears 
at  least  in  one  deterministic  problem  in  D. 

Another  determinization-based  replanner  is  HMDPP  [Keyder  and  Geffner,  2008].  Instead  of 
using  the  all-outcomes  or  the  most-likely  outcomes,  HMDPP  uses  the  self-loop  determinization , 
a  determinization  approach  that  implicitly  encode  the  probability  of  actions  is  in  their  costs. 
Formally,  given  an  SSP  §  =  (S.  s0,  G,  A,  P,  C),  the  self-loop  determinization  of  §  is  the  problem 
D  =  (S,  s0,  G,  A,  C),  in  which,  for  all  s  E  S,  a  E  A  and  s'  E  S  such  that  P(s'\s,a)  >  0, 
A  contains  the  action  a'  that  deterministically  transforms  s  into  s'  and  its  cost  C(s,a',s')  is 
C(s,  a' .  s')/P(s'\s,  a).  Therefore,  solutions  for  D  that  use  low  probability  effect  of  actions  are 
penalized.  HMDPP  also  pre-process  the  original  SSP  §  using  pattern  databases  [Haslum  et  al., 
2007]  for  a  fixed  amount  of  time  in  order  to  obtain  a  set  of  partial  policies  i Tdb  from  some  states 
in  S  to  the  goal.  These  two  techniques  are  combined  in  HMDPP  as  follows:  at  a  state  s,  if  there 
is  a  pre-computed  policy  ndb  from  s  to  the  goal,  then  n db(s)  is  applied;  otherwise,  a  solution  ndet 
for  the  self-loop  determinization  of  §  is  computed  from  s  and  executed  until  a  state  s',  in  which 
ndet  is  not  defined,  is  reached.  This  process  is  repeat  until  a  goal  state  is  reached. 

Based  on  solution  refinement,  two  other  non-optimal  replanners  were  proposed:  Envelope 
Propagation  (EP)  [Dean  et  al.,  1995]  and  Robust  FF  (RFF)  [Teichteil-Koenigsbuch  et  al.,  2008]. 
In  general  terms,  EP  and  RFF  compute  an  initial  partial  policy  tt  and  iteratively  expand  it  in  order 
to  avoid  replanning.  EP  performs  state  aggregation  by  selecting  a  set  of  states  S  and  replacing 
them  by  a  meta  state  out.  Set  of  states  S  is  obtained  by  finding  states  that  have  low  probability 
of  being  reached  and  also  have  an  expected  cost  larger  than  the  current  state.  The  obtained  state 
space  S'  equals  out  U(S\S)  and  special  actions  are  also  added  to  the  aggregated  SSP  to  represent 
transitions  between  S'  and  the  meta  state  out.  At  each  iteration,  EP  refines  its  approximation 
S'  of  S  by  selecting  states  s  6  S  and  adding  them  to  S'.  After  S'  is  expanded,  a  new  round  of 
aggregation  is  performed  in  order  to  avoid  the  convergence  of  S'  to  S.  If  a  state  s  E  5  needs  to  be 
avoided,  e.g.,  high  cost  states  and  dead  ends,  then  EP  is  unable  to  take  that  signal  into  account  to 
effectively  avoid  them. 
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RFF,  the  winner  of  the  third  IPPC  [Bryce  and  Buffet,  2008],  uses  a  different  approach  for 
solution  refinement:  an  initial  partial  policy  7r  is  computed  by  solving  the  most-likely  outcome 
determinization  of  the  original  problem  using  FF  and  then  the  robustness  of  7r  is  iteratively  im¬ 
proved.  For  RFF,  robustness  is  defined  as  the  probability  of  replanning,  i.e.,  given  p  e  [0,1], 
RFF  computes  7r  such  that  the  probability  of  replanning  when  following  7r  from  s0  is  at  most  p. 
Since  computing  the  probability  of  replanning  when  following  7r  is  costly,  RFF  approximates  it 
by  performing  Monte-Carlo  simulations. 

An  orthogonal  direction  from  all  other  approaches  mentioned  so  far  is  applied  by  t-look- 
ahead  [Pearl,  1985,  Russel  and  Norvig,  2003]  and  Upper  Confidence  bound  for  Trees  (UCT) 
[Kocsis  and  Szepesvri,  2006].  The  approach  employed  by  these  algorithms  is  to  relax  SSPs  into 
finite-horizon  MDPs  with  goals,  i.e.,  to  modify  horizon  of  the  SSP  from  indeterminate  to  finite. 
T -look-ahead  fixes  the  horizon  of  the  relaxed  problem  to  t  time  steps  and  solves  it  using  dynamic 
programming  (Chapter  2). 

UCT  is  an  approximation  of  the  /-look-ahead  obtained  by  using  sparse  sampling  techniques. 
Formally,  UCT  iteratively  builds  a  policy  tree  by  expanding  the  best  node  according  to  a  biased 
version  of  the  Bellman  equations  (Equation  (2.2)  p.12)  to  ensure  that  promising  actions  are  sam¬ 
pled  more  often.  Notice  that  UCT,  as  f-look-ahead,  builds  a  policy  tree,  i.e.,  a  policy  free  of 
loops,  since  the  horizon  of  the  problem  is  relaxed  from  indefinite  to  finite  of  size  t.  While  UCT 
does  not  require  the  search  depth  parameter  t,  it  is  governed  by  two  other  parameters:  w  the 
number  of  samples  per  decision  step  and  c  the  weight  of  the  bias  term  for  choosing  actions.  UCT 
is  the  base  of  PROST  [Keller  and  Eyerich,  2012],  the  winner  of  IPPC  201 1  [Coles  et  al.,  2012]. 

In  the  context  of  motion  planning  in  dynamic  environments,  another  relevant  approach  is 
Variable  Level-of-Detail  (VLOD)  [Zickler  and  Veloso,  2010]  planning  and  execution.  VLOD 
computes  a  collision-free  trajectory  from  an  initial  state  to  the  goal  by  ignoring  the  physical 
interactions  with  poorly  predictable  dynamic  objects  in  the  far  future.  Formally,  VLOD  computes 
a  plan  in  which:  (i)  all  actions  applicable  from  the  initial  state  (time  t  —  0)  until  a  given  time 
threshold  consider  the  original  model  of  the  world;  and  (ii)  all  actions  applicable  at  time 
t  >  t  rx)  f  >  consider  a  relaxed  model  M  of  the  world.  This  relaxed  model  M  simplifies  the  problem 
by  ignoring  the  collisions  between  the  agent  and  the  other  dynamic  objects  in  the  environment, 
e.g.,  the  agent  is  able  to  pass  through  other  moving  agents  and  objects  in  M.  Therefore,  VLOD 
can  efficiently  compute  a  plan  that  locally  avoids  collisions  while  still  taking  in  consideration  the 
goal  set  in  order  to  be  robust  against  local  minima. 

For  planning  with  incomplete  information,  a  relevant  approach  is  assumptive  planning  and 
execution  [Nourbakhsh  and  Genesereth,  1996].  In  this  approach,  the  uncertainty  of  execution 
is  decreased  by  making  simplifying  assumptions,  for  instance,  if  the  initial  state  so  is  partially 
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defined,  then  one  possible  simplifying  assumption  is  to  instantiate  some  of  the  undefined  state 
variables.  Planning  and  execution  is  then  interleaved  through  a  replanning  loop:  (i)  given  the 
current  a  set  of  possible  states  b,  a  smaller  set  b  is  obtained  by  making  additional  assumptions 
about  b;  (ii)  a  conditional  plan  C  from  b  to  the  goal  G  is  computed;  (iii)  C  is  executed  in  the  en¬ 
vironment;  and  (iv)  both  b  and  b  are  updated  according  to  the  actions  applied  in  the  environment. 
When  and  if  b  is  inconsistent,  then  replanning  is  applied  using  the  new  current  incomplete  state 
b.  The  authors  also  provide  sufficient  conditions  over  the  simplifying  assumptions  to  guarantee 
that  this  replanning  approach  is  sound  and  complete. 

6.5  How  our  Work  Fits 

Table  6.1  summarizes  the  related  work  and  provides  an  overview  of  how  this  thesis  fits  with 
respected  to  the  related  work. 

This  thesis  presents  a  novel  relaxation  technique  for  probabilistic  planning,  the  short-sighted 
SSPs.  Short-sighted  SSPs  relax  probabilistic  planning  problems  by  pruning  the  state  space  and 
adding  artificial  goals  to  heuristically  estimate  the  cost  of  reaching  an  original  goal  from  the 
pruned  states.  The  usage  of  artificial  goals  is  the  key  difference  between  short-sighted  SSPs  and 
the  state  space  aggregation  performed  by  EP.  Since  a  heuristic  cost  is  incurred  when  an  artificial 
goal  is  reached,  the  solutions  of  short-sighted  SSPs  can  be  effectively  biased  towards  the  original 
goals  and  away  from  high-costs  areas  of  the  state  space. 

Short-sighted  SSPs  also  differ  from  determinizations  because  they  do  not  change  the  action 
structure.  Therefore  all  effects  of  actions  are  considered  and  their  probabilities  are  not  ignored. 
Similarly,  short-sighted  SSPs  differ  from  VLOD  and  assumptive  planning  and  execution  because 
they  neither  simplify  the  model  in  the  far  future  nor  make  additional  assumptions  to  reduce 
uncertainty.  Instead,  short-sighted  SSPs  use  a  heuristic  to  estimate  the  cost  of  reaching  the  goal 
from  the  artificial  goals  and  preserve  the  original  action  structure.  These  features  allow  SSiPP 
to  iteratively  improve  the  given  heuristic  until  it  e-converges  to  the  optimal  solution  of  the  SSP 
begin  solved.  Notice  that  the  determinization  approaches,  VLOD,  and  assumptive  planning  are 
not  able  to  compute  e-optimal  solutions  of  SSPs. 

Depth-based  short-sighted  SSPs,  one  formulation  of  short-sighted  SSPs,  also  presents  a  novel 
property  with  respect  to  the  previous  work:  closed  policies  for  (s,  t) -depth-based  short-sighted 
SSPs  can  be  applied  to  the  original  SSP  for  at  least  t  steps  without  replanning.  The  replanners 
reviewed  in  Section  6.4  do  not  guarantee  how  many  steps  their  partial  policies  can  be  applied 
before  replanning  is  needed. 
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type  of  policy  computed;  if  the  planner  is  able  to  use  state  space  heuristics  H(s);  the  simplification  applied  to  manage  the  uncertainty 
structure  of  the  problems;  and  the  overall  approach  employed  by  the  planner. 
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Chapter  7 

Empirical  Evaluation 


In  this  chapter,  we  present  a  rich  empirical  comparison  between  the  proposed  algorithms  and 
state-of-the-art  probabilistic  planners  and  replanners.  We  begin  by  reviewing  the  domains  and 
problems  used  in  the  experiments.  Next,  in  Section  7.2,  we  present  a  series  of  experiments 
to  evaluate  the  convergence  time  to  the  e-optimal  solution  of  SSiPP,  Labeled-SSiPP,  and  other 
optimal  planners.  In  Section  7.3,  we  simulate  an  International  Probabilistic  Planning  Competi¬ 
tion  (IPPC)  [Younes  et  al.,  2005,  Bonet  and  Givan,  2007,  Bryce  and  Buffet,  2008]  using  SSiPP, 
Labeled-SSiPP,  SSiPP-FF,  previous  IPPC  winners  and  other  state-of-the-art  planners  as  contes¬ 
tants. 


7.1  Domains  and  Problems 

In  this  section,  we  present  the  four  domains  from  IPPC’08  [Bryce  and  Buffet,  2008]  which  we  use 
in  our  experiments.1  The  first  two  domains,  probabilistic  blocks  world  (Section  7.1.1)  and  zeno 
travel  (Section  7.1.2),  are  probabilistic  extensions  of  their  deterministic  counterparts.  Triangle 
tire  world  (Section  7.1.3)  and  exploding  blocks  world  (Section  7.1.4)  are  probabilistic  interesting 
problems  [Little  and  Thiebaux,  2007],  i.e.,  problems  in  which  approaches  that  oversimplify  the 
probabilistic  structure  of  the  actions  perform  poorly. 

7.1.1  Probabilistic  Blocks  World 

The  probabilistic  blocks  world  is  an  extension  of  the  well-known  blocks  world  in  which  the 
actions  pick-up  and  put-on-block  can  fail  with  probability  1/4.  If  and  when  these  actions  fail, 
the  target  block  is  dropped  on  the  table,  for  instance,  pick-up  A  from  B  results  in  block  A  being 

1  All  problems  from  IPPC’ 08  are  available  at  http :  //  ippc-2008  .  loria .  fr /wiki/ index .  php/ Re  suits  .  html 


71 


72 


Chapter  7:  Empirical  Evaluation 


Problem  # 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Number  of  blocks 

5 

10 

14 

18 

Cost  of  pick-up 

1 

2 

2 

3 

1 

2 

2 

4 

1 

2 

2 

4 

1 

2 

4 

Cost  of  pick-up-f rom-table 

1 

2 

3 

2 

1 

2 

3 

3 

1 

2 

3 

3 

1 

2 

3 

Table  7.1:  Number  of  blocks  and  the  cost  of  actions  pick-up  and  pick-up-from-tabie  for  each 
of  the  15  problems  considered  from  the  probabilistic  blocks  world. 


on  the  table  with  probability  1/4.  The  action  pick-up-from-tabie  also  fails  with  probability 
1/4,  in  which  case  nothing  happens,  i.e.,  the  target  block  remains  on  the  table.  Lastly,  the  action 
put -down  deterministically  puts  the  block  being  held  on  the  table. 

This  probabilistic  version  of  blocks  world  also  contains  three  new  actions  that  allow  towers 
of  two  blocks  to  be  manipulated:  pick-tower,  put-tower-on-block  and  put-tower-down. 
While  action  put-tower-down  deterministically  puts  the  tower  still  assembled  on  the  table,  the 
other  two  actions  are  probabilistic  and  fail  with  probability  9/10.  The  current  state  is  not  changed 
when  pick-tower  fails  and  put-tower-on-block  fails  by  dropping  the  tower  on  the  table  (the 
dropped  tower  remains  built). 

Since  every  action  in  the  probabilistic  blocks  world  is  reversible,  the  goal  is  always  reach¬ 
able  from  any  state;  therefore  Assumption  2.1  holds  for  all  problems  in  this  domain.  The  ac¬ 
tions  put-on-block,  put-down,  pick-tower,  put-tower-on-block  and  put-tower-down  have 
cost  1 .  In  order  to  explore  the  trade-offs  between:  (i)  putting  a  block  on  top  of  other  blocks  versus 
putting  a  block  on  the  table;  and  (ii)  picking  up  a  single  block  versus  a  tower  of  blocks,  the  cost 
of  pick-up  and  pick-up-from-tabie  actions  is  different  for  each  problem.  Table  7.1  shows 
the  total  number  of  blocks  and  the  cost  of  both  pick-up  and  pick-up-from-tabie  actions  for 
the  15  problems  considered.  In  all  the  considered  problems,  the  goal  statement  contains  all  the 
blocks.  For  the  remainder  of  this  chapter,  we  refer  to  the  probabilistic  blocks  worlds  as  blocks 
world. 

7.1.2  Zeno  Travel 

The  zeno  travel  domain  is  a  logistic  domain  in  which  a  given  number  of  people  need  to  be 
transported  from  their  initial  locations  to  their  destinations  using  a  fleet  of  airplanes.  Moreover, 
the  level  of  fuel  of  each  airplane  is  also  modeled  and  therefore  there  is  a  need  to  plan  to  refuel. 

The  available  actions  in  this  domain  are:  boarding,  debarking,  refueling,  flying  (at  regular 
speed)  and  zooming  (flying  at  a  faster  speed).  Each  action  has  a  random  duration  modeled  by 
a  geometrically  distributed  random  variable  with  probability  p;  the  expected  duration  of  each 
action,  i.e.,  the  number  of  time  steps  necessary  to  succeed,  is  1/p.  In  order  to  ensure  the  ge- 
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Problem  # 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Cities 

4 

5 

5 

6 

6 

7 

7 

8 

9 

10 

11 

13 

14 

15 

20 

Persons 

2 

2 

5 

2 

5 

10 

5 

5 

10 

5 

10 

5 

10 

10 

10 

Airplanes 

2 

2 

3 

2 

3 

6 

3 

3 

6 

3 

6 

3 

6 

6 

6 

Table  7.2:  Number  of  cities,  persons  and  airplanes  for  each  of  the  15  problems  considered  of  the 
zeno  travel  domain. 

ometric  duration  of  the  available  actions,  they  are  represented  by  a  two-step  procedure,  e.g., 
start-boarding  and  finish-boarding,  in  which  the  first  step  is  always  deterministic  and  the 
second  step  succeeds  with  probability  p.  The  value  of  p  is  1/2,  1/4,  1/7,  1/25  and  1/15  for 
boarding,  debarking,  refueling,  flying  and  zooming,  respectively. 

The  cost  of  all  actions  is  1  except  for  actions  flying  and  zooming  that  have  costs  10  and  25  re¬ 
spectively.  Although  the  fuel  requirement  for  flying  and  zooming  is  the  same,  their  expected  cost 
differ  due  to  their  different  costs  and  success  probabilities:  250  for  flying  and  375  for  zooming. 

As  in  the  blocks  world  domain,  Assumption  2.1  holds  for  all  problems  in  zeno  travel  do¬ 
main.  Table  7.2  shows  the  number  of  persons,  cities  and  airplanes  for  each  of  the  15  problems 
considered.  In  all  the  considered  problems,  the  fuel  level  of  each  airplane  is  discretized  into  5 
categories:  empty,  1/4,  1/2,  3/4  and  full. 

7.1.3  Triangle  Tire  World 

The  triangle  tire  world,  described  in  Section  4.1.2,  is  a  probabilistically  interesting  domain  with 
avoidable  dead  ends.  In  the  experiments,  the  problem  number  corresponds  to  the  parameter  n  of 
the  triangle  tire  world  problem. 

7.1.4  Exploding  Blocks  World 

The  exploding  blocks  world  is  a  probabilistic  extension  of  the  deterministic  blocks  world  in 
which  blocks  can  explode  and  destroy  other  blocks  or  the  table.  Once  a  block  or  the  table  is 
destroyed,  nothing  can  be  placed  on  them  and  destroyed  blocks  cannot  be  moved.  Therefore,  it 
is  possible  to  reach  dead  ends  in  the  exploding  blocks  world.  Moreover,  not  all  problems  in  the 
exploding  blocks  world  domain  have  a  proper  policy,  i.e.,  these  problems  might  have  unavoidable 
dead  ends. 

All  actions  available  in  the  exploding  blocks  world,  pick-up,  pick-up-f  rom-tabie,  put-down 
and  put-on-biock,  have  the  same  effects  as  their  counterparts  in  the  deterministic  blocks  world. 
Pick-up  and  pick-up-f  rom-t able  have  the  extra  precondition  that  the  block  being  picked  up 
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Problem  # 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Number  of  blocks 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

Blocks  in  the  goal 

2 

3 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Table  7.3:  Number  of  blocks  and  blocks  in  the  goal  statement  for  each  of  the  15  problems 
considered  from  the  exploding  blocks  world. 

is  not  destroyed.  Actions  put-down  and  put-on-block  have  the  probabilistic  side-effect  of  det¬ 
onating  the  block  being  held  and  destroying  the  table  or  the  block  below  with  probability  2/5 
and  1/10,  respectively.  Once  a  block  is  detonated,  it  can  be  safely  moved,  i.e.,  a  denoted  block 
cannot  destroy  other  blocks  or  the  table. 

The  IPPC’08  encoding  of  the  exploding  blocks  world  has  a  flaw  in  which  a  block  can  be 
placed  on  top  of  itself  [Little  and  Thiebaux,  2007].  This  flaw  allows  planners  to  safely  discard 
blocks  not  needed  in  the  goal  because,  after  placing  a  block  B  on  top  of  itself:  (i)  no  block  is 
being  held,  i.e.,  the  planner  is  free  to  pick  up  another  block;  and  (ii)  only  B  might  be  destroyed, 
thus  preserving  the  other  blocks  and  the  table.  We  consider  the  fixed  version  of  the  IPPC’08 
exploding  blocks  world,  in  which  the  action  put-on-block  has  the  additional  precondition  that 
the  destination  block  is  not  the  same  as  the  block  being  held;  precisely,  we  added  the  precondition 
(not  (=  ?bl  ?b2))  to  put-on-block  ( ?bl  ?b2). 

Table  7.3  shows  the  total  number  of  blocks  and  blocks  in  the  goal  statement  for  the  15  ex¬ 
ploding  blocks  world  problems  considered.  In  the  considered  problems,  all  actions  have  cost  1. 

7.2  Convergence  to  the  Optimal  Solution 

In  the  following  experiments,  we  compare  the  time  necessary  for  LRTDP  [Bonet  and  Geffner, 
2003],  Focused  Topological  Value  Iteration  (FTVI)  [Dai  et  al.,  2009],  SSiPP  and  Labeled-SSiPP 
to  e-converge  to  the  optimal  solution.  SSiPP-FF  is  not  considered  since  it  is  not  guaranteed  to 
converge  to  an  e-optimal  solution.  For  the  experiments  in  Section  7.2.1,  we  use  the  domains 
from  IPPC’08  (reviewed  in  Section  7.1)  and,  in  Section  7.2.2,  we  use  the  race-track  domain,  a 
common  domain  to  compare  optimal  probabilistic  planners. 

7.2.1  Problems  from  the  International  Probabilistic  Planning  Competition 

In  this  experiment,  we  compare  the  time  to  converge  to  the  e-optimal  solution  for  the  problems 
in  the  IPPC’08  (Section  7.1).  Although  Assumption  2.1  does  not  hold  for  the  triangle  tire  world 
(Section  7.1.3),  all  problems  in  this  domain  are  such  that:  (i)  there  exists  a  proper  (but  not 
complete)  policy;  and  (ii)  the  dead  ends  are  states  in  which  no  action  is  available.  Therefore,  all 
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the  considered  planners  can  trivially  detect  when  a  dead  end  s<j  is  reached,  in  which  case  V  ( Sd ) 
is  updated  to  infinity  and  the  search  is  restarted.  For  this  experiment,  the  value  assigned  to  V (sj) 
is  105;  this  value  is  large  enough  since  V*(s(,)  <  12 n  for  the  triangle  tire  world  problem  of  size 
n.  The  exploding  blocks  world  problems  are  not  considered  because  there  is  no  guarantee  they 
have  a  closed  policy. 

This  experiment  was  conducted  on  a  2.4GHz  machine  with  16  cores  running  a  64-bit  ver¬ 
sion  of  Linux.  The  time  and  memory  cutoff  enforced  for  each  planner  was  2  hours  and  5GB 
respectively.  For  SSiPP  and  Labeled-SSiPP,  we  used  LRTDP  as  c-Optimal-SSP-Solver  and 
depth-based  short-sighted  SSPs  for  t  e  {2,4,8, 16,32}.  The  admissible  heuristic  used  by  all 
the  planners  is  the  classical  planning  heuristic  hmax  applied  to  the  all-outcomes  determinization 
[Teichteil-Konigsbuch  et  al.,  2011]. 

Table  7.4  presents  the  results  of  this  experiment  as  the  average  and  95%  confidence  inter¬ 
val  of  the  e-convergence  time  for  50  runs  of  each  planner  parametrization.  From  the  15  prob¬ 
lems  of  each  domain,  we  only  present  the  results  in  which  at  least  one  planner  e-converged 
to  the  optimal  solution.  The  problems  5'  to  8'  for  blocks  world  are  problems  with  8  blocks 
obtained  by  removing  blocks  69  and  610  from  the  original  IPPC’08  problems  5  to  8.  We  gener¬ 
ated  these  problems  since  no  planner  converged  to  the  optimal  solution  for  problems  5  to  8  and 
problems  1  to  4  are  too  small  (e-convergence  is  reached  in  about  Is). 

The  performance  difference  between  SSiPP  and  Labeled-SSiPP  is  not  significant  for  small 
problems,  i.e.,  blocks  world  1  to  4,  triangle  tire  world  problems  1  and  2  and  zeno  travel  prob¬ 
lem  1  and  2.  For  the  triangle  tire  world  problems  3  and  4,  t  —  32  is  large  enough  that  the 
optimal  solution  is  found  using  a  single  short-sighted  SSP,  therefore  the  performance  of  SSiPP 
and  Labeled-SSiPP  for  t  =  32  is  equivalent  to  the  LRTDP  performance.  For  the  same  prob¬ 
lems,  when  t  <  32,  Labeled-SSiPP  reaches  convergence  using  between  6%  to  32%  of  the  of  the 
convergence  time  of  SSiPP  for  the  value  of  t. 

In  the  triangle  tire  world,  the  best  parametrization  of  Labeled-SSiPP  is  not  able  to  outperform 
LRTDP,  the  best  planner  in  this  domain,  due  to  the  overhead  of  building  the  short-sighted  SSPs. 
This  problem  is  specific  to  the  triangle  tire  domain,  since  there  is  only  one  proper  policy;  there¬ 
fore,  a  planner  that  prunes  improper  policies  can  efficiently  focus  its  search  in  the  single  optimal 
policy  of  the  triangle  tire  world  problems.  For  instance,  the  (s0, 16)-short-sighted  SSP  §s0,i6  as¬ 
sociated  with  problem  4  of  the  triangle  tire  world  contains  124436  states  and  §SOii6  is  generated 
and  solved  on  every  iteration  of  Line  5  of  Labeled-SSiPP  (Algorithm  5.2),  even  after  inferring 
that  §s0,i6  also  contains  only  one  proper  policy.  As  shown  in  Section  4.1.2,  trajectory -based 
short-sighted  SSPs  can  be  used  in  order  to  overcome  this  issue. 


Table  7.4:  Results  of  the  e-convergence  experiment  for  the  IPPC  domains.  Each  cell  represents  the  average  and  95%  confidence 
interval  of  the  time,  in  seconds,  to  converge  to  the  e-optimal  solution  using  e  =  10-4.  If  e-convergence  is  not  reached,  then  is 
shown.  Best  performance  over  all  planners  (column)  is  shown  in  bold  font.  hmax  heuristic  was  used  by  all  planners.  Problems  5'  to  8' 
of  blocks  world  are  the  IPPC’08  problems  5  to  8  without  blocks  59  and  610. 
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For  the  larger  problems  of  the  blocks  world  (5'  to  8'),  Labeled-SSiPP  obtains  a  large  im¬ 
provement  over  the  considered  planners  and  converged  in  at  most  0.93,  0.80  and  0.26  of  the 
time  necessary  for  SSiPP,  LRTDP  and  FTVI  to  converge,  respectively.  Lastly,  in  the  zeno  travel 
domain,  SSiPP  and  Labeled-SSiPP  obtain  a  similar  performance  in  the  small  problems,  i.e., 
problems  1  and  2,  and  converge  in  at  most  0.06  of  LRTDP  convergence  time.  Notice  that  FTVI 
fails  to  converge  in  all  the  zeno  travel  problems  and  Labeled  SSiPP  for  t  =  32  is  the  only  planner 
able  to  converge  for  problem  4  of  the  zeno  travel  domain. 


7.2.2  Race-track  problems 

The  goal  of  a  problem  in  the  race-track  domain  [Barto  et  al.,  1995,  Bonet  and  Geffner,  2003]  is 
to  move  a  car  from  its  initial  location  to  one  of  the  goal  locations,  while  minimizing  the  expected 
cost  of  travel.  A  state  in  the  race-track  domain  is  the  tuple  (x,  y,  vx,  vy ,  b )  in  which: 

•  x  and  y  are  the  position  of  the  car  in  the  given  2-D  grid  (track); 

•  vx  and  vy  are  the  velocities  in  each  dimension;  and 

•  b  is  a  binary  variable  that  is  true  if  the  car  is  broken. 

At  each  time  step,  the  position  (x,  y)  of  the  car  is  updated  by  adding  its  current  speed  (vx,  vy) 
on  their  respective  dimension.  Acceleration  actions,  represented  by  pairs  (ax,  ay)  G  {—1,  0,  l}2 
and  denoting  the  instantaneous  acceleration  in  each  direction,  are  available  to  control  the  car’s 
velocity.  An  acceleration  action  (ax,ay)  can  fail  with  probability  0.1,  in  which  case  the  car’s 
velocity  is  not  changed. 

If  the  car  attempts  to  leave  the  race  track,  then  it  is  placed  in  the  last  valid  position  before 
exiting  the  track,  its  velocity  in  both  directions  is  set  to  zero  and  it  is  marked  as  broken,  i.e.,  b  is 
set  to  true.  The  special  action  fix-car  is  used  in  order  to  fix  the  car  (i.e.,  set  b  to  false).  The  cost 
of  fix-car  is  50  while  the  acceleration  actions  have  cost  1. 

We  consider  six  race-tracks  in  this  experiment:  ring-small,  ring-large,  square-small,  square- 
large,  y-small  and  y-large.  The  shape  of  each  track  is  depicted  in  Figure  7.1  and  Table  7.5 
presents  their  corresponding  state  space  size  |S|,  ratio  of  relevant  states  (i.e.,  S“  |/|S|),  largest 
parameter  t,  tmax,  for  depth-based  short-sighted  SSPs  such  that  tt*..  tmax  is  not  closed  for  the 
original  SSR  and  V*(s0). 

The  admissible  heuristic  used  by  all  the  planners  is  the  min-min  heuristic  hm ;n  and  hmin(s) 
equals  the  cost  of  the  optimal  plan  for  reaching  a  goal  state  from  s  in  the  all-outcomes  deter- 
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Figure  7.1:  Shape  of  the  race-tracks  used  in  the  e-convergence  experiment.  Each  cell  represents 
a  possible  position  of  the  car.  The  initial  position  and  the  goal  positions  are,  respectively,  the 
marked  cells  in  the  bottom  and  top  of  each  track. 


problem 

|S| 

%  rel. 

^max 

^*(s0) 

^min('So) 

time  ^min(so) 

ring-s 

4776 

12.91 

74 

21.85 

12.00 

0.451 

ring-1 

75364 

14.34 

869 

36.23 

24.00 

32.056 

square-s 

42396 

2.01 

71 

18.26 

11.00 

14.209 

square-1 

193756 

0.75 

272 

22.26 

13.00 

145.616 

y- small 

101481 

10.57 

114 

29.01 

18.00 

32.367 

y-large 

300460 

9.42 

155 

32.81 

21.00 

211.891 

Table  7.5:  Description  of  each  race-track  used  in  the  e-convergence  experiment.  The  columns 
represent:  size  of  the  state  space  |S|,  ratio  S^/Sh  imax,  V*(s0),  value  of  the  min-min  heuristic 
for  s0  (hmin(s0))  and  time  in  seconds  to  compute  //mm(.S'0j- 


minization.  Therefore,  hm\n  can  be  computed  by  the  following  fixed  point  equations: 

(  0  if  s  G  G 

I  min  min  \C(s,  a,  s')  +  /rmin(V)l  otherwise 

f  ae A  s'  :  P(s'\s,a)>0 


This  experiment  was  conducted  on  a  3.07GHz  machine  with  4  cores  running  a  32-bit  ver¬ 
sion  of  Linux.  A  time  cutoff  of  2  hours  and  4GB  of  memory  was  applied  to  each  planner. 
For  SSiPP  and  Labeled-SSiPP,  we  used  LRTDP  as  c-Optimal-SSP-Solver  and  depth-based 
short-sighted  SSPs  for  t  e  {4,  8, 16, ... ,  1024}.  FTVI  is  not  considered  in  this  experiment  be¬ 
cause  the  implementation  of  FTVI  we  had  access  to  is  not  compatible  with  the  encoding  of  the 
racetrack  problems.  Table  7.6  presents  the  results  as  the  average  and  95%  confidence  interval  for 
10  runs  of  each  planner  parametrization. 
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The  performance  of  SSiPP,  Labeled-SSiPP  and  LRTDP  is  similar  for  t  >  fmax  in  all  the 
problems  since  LRTDP  is  used  as  c-Optimal-SSP-Solver  and  tmax  is  such  that  con¬ 
tains  all  the  states  necessary  to  find  the  optimal  solution.  The  performance  improvement  of 
Labeled-SSiPP  over  SSiPP  is  more  evident  for  smaller  values  of  t  and  as  t  approaches  tmax  it 
decreases  until  both  Labeled-SSiPP  and  SSiPP  converge  to  the  LRTDP  performance. 

For  the  square  and  y  tracks,  the  best  performance  is  obtained  by  Labeled-SSiPP  for  t  ei¬ 
ther  64  (small  tracks)  or  128  (large  tracks),  both  values  smaller  than  tmax  for  their  respective 
problems.  While  this  improvement  obtained  by  Labeled-SSiPP  is  in  the  intersection  of  the 
95%  confidence  interval  for  the  y  tracks,  it  is  statistically  significant  for  the  square  tracks,  es¬ 
pecially  for  the  large  instance:  612.78  ±  30.44  (Labeled-SSiPP)  versus  702.42  ±  12.82  (LRTDP). 
This  difference  in  performance  is  because  the  optimal  policy  in  the  square-large  track  reaches 
only  0.75%  of  the  state  space  (Table  7.5).  Therefore  both  SSiPP  and  Labeled-SSiPP  take  ad¬ 
vantage  of  the  short-sighted  search  to  prune  useless  states  earlier  in  the  search,  resulting  a  better 
performance  than  LRTDP  for  t  e  (32, 64, 128,  256}. 
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In  this  section,  we  compare  the  performance  of  the  following  planners  to  obtain  (sub-optimal) 
solutions  under  a  20  minutes  time  cutoff: 

•  FF-Replan  [Yoon  et  al.,  2007]  (winner  of  IPPC’04), 

•  Robust-FF  [Teichteil-Koenigsbuch  et  al.,  2008]  (winner  of  IPPC’08), 

•  HMDPP  [Keyder  and  Geffner,  2008], 

•  ReTrASE  [Kolobov  et  al.,  2009], 

•  SSiPP, 

•  Labeled-SSiPP,  and 

•  SSiPP-FF. 

The  non-SSiPP  planners  are  reviewed  in  Chapter  6  and,  for  these  experiments,  we  use  15  prob¬ 
lems  from  IPPC’08  of  each  domain  described  in  Section  7.1.  We  present  the  methodology  used 
in  this  experiment  in  Section  7.3.1.  In  Section  7.3.2,  we  describe  heuristics  to  choose  the  param¬ 
eters  of  SSiPP,  Labeled-SSiPP  and  SSiPP-FF,  i.e.,  the  value  of  t  for  depth-based  short-sighted 
SSPs.  Section  7.3.3  presents  the  results  of  this  experiment. 
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7.3.1  Methodology 

We  use  a  methodology  similar  to  the  IPPCs,  in  which  there  is  a  time  cutoff  for  each  individual 
problem:  a  planner  has  20  minutes  to  compute  a  policy  and  simulate  the  computed  policy  50 
times  from  the  initial  state  s0.  A  round  is  each  simulation  from  s0  of  the  same  problem,  and 
rounds  are  simulated  in  a  client/server  approach  using  MDPSIM  [Younes  et  al.,  2005],  an  SSP 
(and  MDP)  simulator.  Planners  send  actions  to  be  simulated  to  MDPSIM  and  MDPSIM  inter¬ 
nally  simulates  the  received  actions  and  returns  the  resulting  state.  Every  round  terminates  when 
either:  (i)  the  goal  is  reached;  (ii)  an  invalid  action,  e.g.,  not  applicable  in  the  current  state,  is  sent 
to  MDPSIM;  (iii)  2000  actions  have  been  submitted  to  MDPSIM;  or  (iv)  the  planner  explicitly 
gives  up  from  the  round,  e.g.,  because  it  inferred  that  it  is  trapped  in  a  dead  end.  A  round  is 
considered  successful  if  the  goal  is  reached,  otherwise  it  is  declared  as  a  failed  round.  Notice 
that  planners  are  allowed  to  change  their  policies  at  any  time,  i.e.,  during  a  round  or  in  between 
rounds.  Therefore,  the  knowledge  obtained  from  one  round,  e.g.,  the  lower  bound  on  V*(s0), 
can  be  used  to  solve  subsequent  rounds. 

A  run  is  the  sequence  of  rounds  simulated  by  a  planner  for  a  given  problem  and  the  previous 
IPPCs  evaluate  planners  based  on  a  single  run  per  problem.  Due  to  the  stochastic  nature  of  SSPs, 
the  outcome  of  a  single  run  depends  on  the  random  seed  used  in  the  initialization  of  both  the 
planner  and  MDPSIM.  In  order  to  evaluate  planners  more  accurately,  we  execute  50  runs  for 
each  problem  and  planner,  and  no  information  is  shared  between  the  different  runs,  i.e.,  all  the 
internal  variables  of  the  planners  are  reseted  when  a  new  run  starts.  Therefore,  in  this  section, 
the  performance  of  a  planner  in  a  given  problem  is  estimated  by  2500  rounds  generated  by 
potentially  50  different  policies  computed  by  the  same  planner.  Notice  that  our  approach  (50 
runs  of  50  rounds  each)  is  not  equivalent  to  the  execution  of  one  run  of  2500  rounds.  In  the 
latter  case,  a  planner  might  be  guided  towards  bad  decisions  by  the  outcomes  of  the  probabilistic 
actions  and  not  have  enough  time  to  revise  such  decisions.  Alternatively,  by  simulating  several 
runs,  there  is  small  probability  that  this  guidance  will  happen  in  all  the  runs. 

In  order  to  respect  the  20  minutes  time  cutoff,  SSiPP,  Labeled-SSiPP  and  SSiPP-FF  solve 
rounds  internally  for  15  minutes  and  then  start  solving  rounds  through  MDPSIM.  For  SSiPP 
and  SSiPP-FF,  a  round  is  solved  internally  by  calling  Algorithms  3.2  and  5.5,  respectively,  and 
the  obtained  lower  bound  V  in  round  r  is  used  as  heuristic  for  round  r  +  1.  The  same  effect  is 
obtained  for  Fabeled-SSiPP  by  adding  a  15  minutes  time  cutoff  in  Fine  5  of  Algorithm  5.2. 

The  IPPCs  also  enforce  that  planner  must  not  have  free  parameters,  i.e.,  the  only  input  for 
each  planner  is  the  problem  to  be  solved.  Therefore,  all  parameters  of  a  planner,  e.g.,  the  value 
of  t  and  heuristic  for  SSiPP,  must  be  fixed  a  priori  or  automatically  derived.  Because  of  this 
rule,  all  the  non-SSiPP  planners  considered  do  not  have  parameters.  In  the  IPPC’08,  two  differ- 
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ent  parametrization  were  fixed  for  Robust-FF  and  we  consider  only  the  RFF-PG  parametrization, 
since  it  obtained  the  best  performance  in  IPPC’08  for  the  considered  problems  [Bryce  and  Buffet, 
2008].  Section  7.3.2  describes  the  two  different  methods  we  employed  to  obtain  the  parametriza- 
tions  for  SSiPP,  Labeled-SSiPP  and  SSiPP-FF. 


7.3.2  Choosing  the  value  of  t  and  heuristic  for  SSiPP-based  planners 

In  order  to  choose  a  fixed  parametrization  for  SSiPP,  Labeled-SSiPP  and  SSiPP-FF,  i.e.,  a  value 
of  t  and  a  heuristic,  we  perform  a  round-robin  tournament  between  different  parametrizations 
of  each  planner.  The  round-robin  tournament  consists  in  comparing  the  performance  of  dif¬ 
ferent  parametrizations  of  a  planner  in  the  15  final  problems  from  IPPC’06  for  blocks  world, 
zeno  travel,  and  exploding  blocks  world.  While  these  three  domains  are  the  same  between 
IPPC’06  and  IPPC’08,  their  final  problems  are  different.  No  problem  from  the  triangle  tire 
world  is  used  for  training,  since  they  are  deterministically  generated,  i.e.,  any  triangle  tire  world 
of  size  {1, . . . ,  15}  would  be  exactly  the  same  as  the  problems  in  the  main  experiment.  We  refer 
to  these  45  problems  as  the  set  of  training  problems  J. 

Formally,  given  a  planner  X  and  a  set  of  parametrizations  K  =  {At  ,  •  •  • ,  Ay,}  for  X,  we 
solve  all  problems  in  J  using  the  same  methodology  as  described  in  Section  7.3.1.  We  denote  as 
c(Ay,p)  the  number  of  rounds  of  the  problem  p  E  J  in  which  X,  using  parametrization  Ay  E  K, 
reached  the  goal.  The  function  m(Ay ,  kj )  represents  the  tournament  bracket  between  Ay  and  kj , 
and  m(ki,  kj)  equals  1  if 


{p  E  P\c{ki,p)  >  c(kj,p)} 


> 


{p  E  P| c(ki,p)  <  c(kj,p)}  , 


i.e.,  if  ki  outperforms  kj  in  most  of  the  problems,  and  0  otherwise.  The  tournament  winner 
is  the  parametrization  k  that  outperforms  the  majority  of  other  parametrizations  in  K,  that  is, 

k  =  argmaxfc,gK  m(ki >  %)• 

For  SSiPP  and  Labeled-SSiPP,  the  set  of  considered  parametrizations  K  is  the  cross  product 
of  T  =  {2,3,4,...,  10}  and  the  following  set  H  of  heuristics: 

•  zero-heuristic:  h0(s)  =  0  for  all  s  E  S; 

•  FF-heuristic:  /iff (s)  equals  the  cost  of  the  plan  returned  by  the  deterministic  planner  FF 
[Hoffmann  and  Nebel,  2001]  to  reach  a  goal  state  from  the  current  state  s  in  the  all¬ 
outcomes  determinization;  and 

•  hmax  and  /radd  applied  to  the  all-outcomes  determinization  of  the  original  problem  [Teichteil- 
Konigsbuch  et  al.,  2011]. 
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For  SSiPP-FF,  the  determinization  type  is  also  a  parameter  and  its  set  of  considered  parametriza- 
tions  K  equals  T  x  H  x  {most-likely  outcome,  all-outcomes}.  The  parametrization  that  won 
the  round-robin  tournament  for  each  SSiPP-based  planner  in  their  respective  set  of  considered 
parameters  K  is:  t  =  3  and  /?,add  for  SSiPP;  t  —  6  and  /iadd  for  Labeled-SSiPP;  and  t  —  3,  /radd  and 
the  all-outcomes  determinization  for  SSiPP-FF.  We  refer  to  these  parametrizations  as  SSiPPf, 
Labeled-SSiPP*  and  SSiPP-FF*. 

We  also  consider  an  approach  in  which  the  value  of  t  is  randomly  selected  for  SSiPP,  Labeled- 
SSiPP  and  SSiPP-FF.  Formally,  we  select  t  at  random  from  {2,3,4, . . . ,  10}  before  calling 
Generate-Short-Sighted-SSP  in  Algorithms  3.2,  5.2  and  5.5.  Therefore,  different  values 
of  t  might  be  used  for  solving  a  given  problem.  For  this  approach,  we  use  //add  as  heuristic  for  all 
the  SSiPP-based  planners  and  the  all-outcomes  determinization  for  SSiPP-FF.  Also,  in  order  to 
avoid  generating  large  short-sighted  SSPs,  we  stop  Generate-Short-Sighted-SSP  after  15 
seconds  or  if  |SS)t|  >  105.  When  Generate-Short-Sighted-SSP  is  interrupted,  the  states 
that  could  not  be  explored  are  marked  as  artificial  goals.  We  refer  to  these  parametrizations  as 
SSiPP,.,  Labeled- SSiPPr  and  SSiPP-FFr. 


7.3.3  Results 

This  experiment  was  conducted  on  a  2.4GHz  machine  with  16  cores  running  a  64-bit  version  of 
Linux.  We  use  coverage,  i.e.,  the  ratio  between  the  number  of  successful  rounds  and  2500  (the 
total  number  of  round),  as  performance  metric.  Table  7.7  presents  the  summary  of  the  results  as 
the  number  of  problems  in  which  a  given  planner  has  the  best  coverage.  The  detailed  results  are 
presented  in  Tables  7.8  and  7.9  as  the  coverage  obtained  by  each  planner  in  every  problem,  and 
in  Tables  7.10  and  7.11  as  the  average  and  95%  confidence  interval  for  the  obtained  cost  over  the 
successful  rounds  for  each  problem. 

SSiPP-FF,  and  SSiPP-FF  r  successfully  take  advantage  of  determinizations  and  improved 
the  coverage  obtained  by  SSiPP  and  Labeled-SSiPP  in  the  domains  without  dead  ends,  i.e., 
blocks  world  and  zeno  travel.  In  particular,  both  parametrizations  of  SSiPP-FF,  together  with 
FF-Replan,  are  the  only  planners  able  to  solve  the  medium  and  large  problems  of  the  zeno  travel 
domain.  SSiPP-FF  also  improves  the  performance  of  FF-Replan  for  problems  with  dead  ends.  In 
the  triangle  tire  world,  a  problem  designed  to  penalize  determinization  approaches,  FF-Replan, 
SSiPP-FF,  and  SSiPP-FFr  solve  instances  up  to  number  5,  7  and  9,  respectively;  moreover,  the 
coverage  of  SSiPP-FFr  is  more  than  the  double  of  the  coverage  of  FF-Replan  for  problems  1  to  5. 
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Blocks 

World 

Zeno 

Travel 

Triangle 
Tire  W. 

Exploding 
Blocks  W. 

FF-Replan 

13 

15 

0 

1 

Robust-FF 

8 

0 

4 

1 

HMDPP 

4 

2 

13 

1 

ReTrASE 

8 

n.a. 

4 

1 

SSiPP  t 

4 

0 

1 

2 

SSiPPr 

4 

2 

2 

8 

F-SSiPPf 

5 

2 

2 

2 

F-SSiPPr 

5 

2 

2 

3 

SSiPP-FFt 

8 

11 

0 

2 

SSiPP- FFr 

8 

13 

0 

7 

Table  7.7:  Summary  of  the  IPPC  experiment.  Each  cell  represents  the  number  of  problems  per 
domain  in  which  a  given  planner  has  the  best  coverage.  For  each  problem,  more  than  one  planner 
might  obtain  the  best  coverage,  therefore  the  columns  do  not  add  up  to  15.  ReTrASE  does  not 
support  the  zeno  travel  problems  ( n.a .). 


In  the  exploding  blocks  world,  the  combination  of  SSiPP  and  determinizations  is  especially 
useful  for  large  instances:  SSiPP-FFr  is  the  planner  with  the  best  coverage  for  the  5  largest 
problems  in  this  domain.  The  solution  quality  of  FF-Replan  is  also  improved  by  SSiPP-FF.  For 
instance,  in  zeno  travel  problems  1  to  10  and  12,  i.e.,  all  the  problems  in  which  the  SSiPP-FF 
obtained  coverage  1,  the  solutions  found  by  SSiPP-FF,  and  SSiPP-FF t  have  average  cost  between 
0.80  and  0.92  of  the  FF-Replan  solutions  average  costs. 

Fabeled-SSiPP  performs  well  in  the  small  problems,  obtaining  good  coverage  and  solutions 
with  small  average  cost;  however  Fabeled-SSiPP  fails  to  scale  up  to  large  problems.  The  reason 
for  not  scaling  up  is  the  bias  for  exploration  over  exploitation  employed  by  Fabeled-SSiPP  in 
order  to  speedup  the  convergence  to  the  e-optimal  solution. 

All  SSiPP-based  planners  perform  well  in  the  exploding  blocks  world:  SSiPP,  has  the  best 
coverage  in  9  of  the  problems;  SSiPP-FF,.  has  the  best  coverage  in  the  5  largest  problems;  and, 
for  all  the  considered  problems  in  the  exploding  blocks  world,  a  SSiPP-based  planner  has  the 
best  coverage. 

The  performance  in  the  triangle  tire  world  problems  is  dominated  by  HMDPP.  In  this  domain, 
the  chosen  parametrizations  of  SSiPP,  Fabeled-SSiPP,  and  SSiPP-FF  do  not  perform  as  well  as 
HMDPP  or  ReTrASE  because  their  parametrizations  use  ha dd  as  heuristic.  //add  in  the  triangle  tire 
world  guides  the  planners  towards  dead  ends  and  the  SSiPP-based  planners  manage  to  avoid  only 
the  dead  ends  visible  inside  the  short-sighted  SSPs.  As  shown  in  Section  4.1.2,  SSiPP  performs 
the  best  in  the  triangle  tire  domain  when  the  zero-heuristic  is  used  and  Table  7.12  shows  the 
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Table  7.8:  Coverage  for  the  blocks  world  and  zeno  travel  domains  in  the  IPPC  experiment.  Best  coverage  for  each  problem  (row)  is 
shown  in  bold.  If  no  round  is  solved,  i.e.,  zero  coverage,  then  is  shown.  ReTrASE  does  not  support  the  zeno  travel  problems  ( n.a .). 


Table  7.9:  Coverage  for  the  triangle  tire  world  and  exploding  blocks  world  domains  in  the  IPPC  experiment.  Best  coverage  for  each 
problem  (row)  is  shown  in  bold.  If  no  round  is  solved,  i.e.,  zero  coverage,  then  is  shown. 
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Chapter  7:  Empirical  Evaluation 
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Table  7.10:  Cost  of  the  solutions  for  the  block  world  and  zeno  travel  domains  in  the  IPPC  experiment.  Each  cell  represents  the  average 
and  95%  confidence  interval  for  the  obtained  cost  over  the  successful  rounds.  If  no  round  is  solved,  then  is  shown;  if  exactly  one 
round  is  solved,  then  oo  is  shown  in  the  95%  confidence  interval.  ReTrASE  does  not  support  the  zeno  travel  problems  ( n.a .). 


Table  7.11:  Cost  of  the  solutions  for  the  triangle  tire  world  and  exploding  blocks  domain  in  the  IPPC  experiment.  Each  cell  represents 
the  average  and  95%  confidence  interval  for  the  obtained  cost  over  the  successful  rounds.  If  no  round  is  solved,  then  ’-’is  shown;  if 
exactly  one  round  is  solved,  then  oo  is  shown  in  the  95%  confidence  interval. 
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Problem 

SSiPP 

L-SSiPP 

SSiPP-FF 

i 

1.000 

1.000 

1.000 

2 

1.000 

1.000 

1.000 

3 

0.997 

1.000 

0.533 

4 

0.977 

1.000 

0.162 

2 

5 

0.963 

1.000 

0.082 

1 

6 

0.950 

1.000 

0.049 

£ 

7 

0.913 

1.000 

0.023 

H 

8 

0.870 

0.868 

0.015 

_Oj 

■a 

9 

0.882 

0.798 

0.003 

! 

10 

0.842 

0.767 

- 

£ 

11 

0.773 

0.717 

- 

12 

0.738 

0.633 

- 

13 

0.717 

0.595 

- 

14 

0.685 

0.518 

- 

15 

0.617 

0.422 

- 

Table  7.12:  Coverage  of  SSiPP-based  planner  in  the  triangle  tire  world  using  depth-based  short¬ 
sighted  SSPs  and  the  zero-heuristic.  For  all  planners,  the  parameter  t  equals  8  for  all  the  planners 
and,  for  SSiPP-FF,  the  all-outcomes  determinization  is  used.  Best  coverage  for  each  problem 
(row),  with  respect  to  the  results  in  Tables  7.8  and  7.9,  are  shown  in  bold.  If  no  round  is  solved, 
then  ’-’is  shown. 


performance  of  SSiPP,  Labeled-SSiPP,  and  SSiPP-FF  using  the  parametrization  t  —  8  and  the 
zero-heuristic  (for  SSiPP-FF,  the  all-outcomes  determinization  is  used).  For  these  parametriza- 
tions,  the  coverage  obtained  by  SSiPP,  Labeled-SSiPP,  and  SSiPP-FF  is  significantly  improved: 
Labeled-SSiPP  solved  all  the  rounds  for  the  problems  1  to  7;  and  SSiPP  has  the  best  coverage 
for  the  3  largest  problems  in  comparison  with  all  the  considered  planners. 


7.4  Summary 

In  this  chapter,  we  presented  a  rich  empirical  comparison  between  the  proposed  algorithms  and 
other  state-of-the-art  algorithms  in  two  tasks:  finding  an  e-optimal  solution  and  finding  a  (sub- 
optimal)  solution  under  the  International  Probabilistic  Planning  Competition  (IPPC)  rules,  e.g., 
small  time  cutoff.  The  results  from  the  e-convergence  experiments  showed  that  Labeled-SSiPP, 
using  LRTDP  as  underlying  SSP  solver,  outperforms  SSiPP,  LRTDP  and  FTVI  on  problems  from 
the  IPPC  and  on  control  problems  with  low  ratio  of  relevant  states,  i.e.,  S"|/|S|.  The  results 
obtained  in  the  experiment  following  the  IPPC  rules  show  that  SSiPP-FF  successfully  combines 
the  behavior  of  SSiPP  and  FF-Replan  by  having  a  large  coverage  in  problems  without  dead  ends 
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and  significantly  improving  the  coverage  of  FF-Replan  in  problems  with  dead  ends.  These  re¬ 
sults  also  show  that  SSiPP  and  SSiPP-FF  consistently  outperforms  the  other  planners  in  all  the 
problems  of  the  exploding  blocks  world,  a  probabilistic  interesting  domain. 


Chapter  8 

A  Real  World  Application:  a  Service  Robot 
Searching  for  Objects 


In  this  chapter,  we  present  how  a  mobile  service  robot  moving  in  a  building  in  order  to  find  an 
object,  whose  location  is  not  deterministically  known,  can  use  short-sighted  planning  to  improve 
its  performance.  We  begin  by  motivating  the  mobile  service  problem  and,  in  Section  8.2,  we 
formally  present  how  to  represent  this  problem  as  an  SSP.  In  Section  8.3,  we  empirically  evaluate 
different  planners,  including  SSiPP,  in  different  instances  of  the  mobile  service  robot  problem. 

8.1  Motivation 

The  problem  of  an  autonomous  agent  moving  in  an  environment  to  find  objects  while  minimizing 
the  search  cost  is  ubiquitous  in  the  real  world,  e.g.,  a  taxi  driver  looking  for  passengers  and  min¬ 
imizing  the  usage  of  gas,  a  software  agent  finding  information  about  a  product  on  the  web  while 
minimizing  the  bandwidth  usage,  a  service  robot  bringing  objects  to  users  minimizing  distance 
traversed,  and  a  robot  collecting  rocks  for  experiments  while  minimizing  power  consumption.  In 
all  these  problems,  we  assume  that  the  agent  does  not  know  where  the  exact  objects  are,  and  has 
some  probabilistic  model  of  the  location  of  the  objects. 

For  this  chapter,  our  concrete  motivation  is  the  mobile  service  robot  that  moves  in  a  building 
to  find  an  object,  e.g.,  coffee,  and  to  deliver  it  to  a  location,  e.g.,  office  #171.  We  assume  that 
the  robot  is  given  a  map  of  the  environment  and  that  the  object  can  be  in  more  than  one  location. 
Also,  we  consider  that  the  probability  of  the  object  being  at  a  location  type,  e.g.,  offices,  is  given. 
Such  prior  distribution  can  be  designed  by  an  expert  or  automatically  obtained,  for  example  by 
querying  the  web  (e.g.,  [Samadi  et  al.,  2012]).  In  particular,  we  focus  on  the  problem  of  finding 
the  desired  object,  since  the  delivery  problem  can  be  cast  as  the  problem  of  finding  an  object  that 
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is  deterministically  present  only  in  the  delivery  location.  In  the  next  section,  we  present  how  to 
represent  the  problem  of  finding  a  given  object  as  an  SSP. 


8.2  Representing  the  Problem  as  an  SSP 

In  this  section  we  present  our  formulation  of  the  problem  of  finding  an  object  in  a  building  as  an 
SSP  represented  in  PPDDL  (Section  2.2).  For  this  representation,  we  use  one  domain  variable, 
location,  that  describes  the  locations  the  agent  is  allowed  to  visit  and  the  following  predicates 
defined  over  locations: 

•  connected(Zi,f2):  true  when  the  agent  can  move  from  location  l\  to  l2; 

•  at (l):  to  represent  the  agent’s  current  location; 

•  ob  jAt(Z):  to  denote  that  an  instance  of  the  object  being  searched  for  is  at  l; 

•  searched (l):  to  indicate  that  l  has  already  being  searched; 

•  and  a  set  of  predicates  to  denote  the  type  of  each  location,  e.g.,  isOffice(Z)  for  office 
locations  and  isKitchenf/)  for  kitchens. 

Also,  we  use  the  state  variable  hasOb  ject  to  indicate  that  the  agent  has  the  desired  object. 

For  each  location  type  t,  we  use  the  binary  random  variable  Xt  to  denote  if  the  object  is 
at  the  locations  of  type  t  and  we  assume  that  a  prior  probability  P(Xt )  is  given.  Note  that 

P(Xt  =  true)  is  not  required  to  sum  up  to  1.  This  feature  is  used  for  representing  scenarios 
such  as  an  object  that  can  be  found  deterministically  in  more  than  one  location  type  or  an  object 
that  has  a  low  probability  to  be  found  in  any  location  type.  To  simplify  notation,  we  denote 
P(Xt  =  true)  as  pt  for  every  location  type  t. 

We  model  the  object  finding  through  a  pair  of  action  schemas,  Search  and  Pickup.  The 
action  Search(f),  depicted  in  Figure  8.1,  has  the  precondition  that  the  agent  is  at  location  /  and 
l  has  not  been  searched  before.  Its  effect  is  searched(Z),  i.e.,  to  mark  /  as  searched,  and,  with 
probability  pt,  where  t  is  the  location  type  of  /,  the  object  is  found.  With  probability  1  —  pt, 
the  object  is  not  found  at  l.  Since  searched (l)  is  true  after  the  execution  of  Search(Z),  the 
agent  cannot  search  the  same  location  /  more  than  once.  We  enforce  this  restriction  because 
(1  —  pt)k  -A  0  as  k  -A  oo  for  p,  >  0,  i.e.,  if  the  agent  were  allowed  to  search  the  same  location 
enough  times  it  would  always  find  the  object  there. 

The  action  pickUp(Z),  depicted  in  Figure  8.2,  represents  the  agent  obtaining  the  object  at 
location  l  if  the  object  is  there.  This  action  can  be  easily  extended  to  encompass  more  general 
scenarios,  e.g.,  a  robotic  agent  with  grippers  that  can  fail  and  the  object  might  not  be  always 
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(  : action  Search 
parameters  (?1  -  location) 

: precondition  (and  (at  ?1)  (not  (searched  ?1))) 
:effect  (and 

(searched  ?1) 

(when  (isBathroom  ?1)  (prob  0.08  (objAt  ?1))) 

(when  (isKitchen  ?1)  (prob  0.18  (objAt  ?1))) 

(when  (isOffice  ?1)  (prob  0.02  (objAt  ?1))) 

(when  (isPrinterR  ?1)  (prob  0.72  (objAt  ?!)))) 


Figure  8.1:  PPDDL  code  for  the  action  Search(f)  of  the  service  robot  problem.  For  this  action 
the  prior  used  for  the  object  being  at  a  location  l  is  8%,  18%,  2%  and  72%  if  l  is,  respectively,  a 
bathroom,  a  kitchen,  an  office  or  a  printer  room. 


(:  action  PickUp 
parameters  (?1  -  location) 

: precondition  (and  (at  ?1)  (objAt  ?1) ) 
:effect  (and 

(not  (objAt  ?loc) ) 

(hasOb ject ) ) 


Figure  8.2:  PPDDL  code  for  the  action  PickUp(7)  of  the  service  robot  problem.. 


obtained  or  a  symbiotic  autonomous  agent  that  might  ask  people  for  help  to  manipulate  the 
object  [Rosenthal  et  al.,  2010].  Such  extensions  can  be  modeled  by  converting  Pickup (l)  into  a 
probabilistic  action  or  a  chain  of  probabilistic  actions. 

We  use  the  action  schema  Move  to  model  the  agent  moving  in  the  map  represented  by  the 
predicate  connected^ .  l2).  The  action  Move  (I, .  W)  is  probabilistic  and  with  probability  p  the 
agent  moves  from  li  to  l2  and  with  probability  1  —  p  the  agent  stays  at  .  For  all  the  examples 
and  experiments  in  this  chapter,  we  use  p  =  0.9. 

Initially,  the  value  of  the  state  variable  hasOb  ject  is  false  and  the  goal  of  the  agent  is  to  reach 
any  state  in  which  hasOb  ject  is  true.  For  easy  of  presentation,  we  define  the  cost  of  all  actions 
to  be  1,  i.e.,  C(s,  a,  s')  —  1  Vs  e  S,  a  e  A,  s'  e  S.  Therefore  the  average  cost  of  reaching  the 
goal  equals  the  average  number  of  actions  applied  by  the  agent. 

To  illustrate  our  model,  consider  the  map  presented  in  Figure  8.3(a).  In  this  map,  the  agent 
is  at  position  0  and  there  are  two  hallways  that  can  be  explored:  (i)  the  right  hallway  of  size 
k  in  which  the  last  location  is  a  kitchen;  and  (ii)  the  left  hallway  with  2 r  offices.  Notice  that 
Figure  8.3(a)  represents  only  the  map  of  the  environment  and  not  the  search  space.  A  fraction  of 
the  search  space  is  depicted  on  Figure  8.3(b). 
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(a)  Map 


r  office  rows 


' '  - 1 


One  office  row 


state  space  zoom  in 


(b)  Search  Space 

at(0f^  ►  at(2H^_  ->  31(3)^-*  ■  at(6) — 


at(1) 


at(4)  at(4).searched(4).objAt(4)  at(4),searched(4),hasObj 

l)  >  at(3f.searched(4f  — 


at(4).searched(4)  - 


at(3),searched(4)  - 


at(5),searched(4)  - 


at(6),searched(4)  - 


Figure  8.3:  Example  of  map  and  state  space  of  the  service  robot  problem,  (a)  Example  of  map 
representing  a  building.  The  agent  is  initially  at  location  0.  Gray  cells  represent  offices,  the  dark 
blue  cell  represents  the  kitchen  and  white  cells  represent  the  hallways,  (b)  Visualization  of  the 
initial  portion  of  the  search  space  for  the  map  on  (a).  Arrows  depict  actions:  arrows  with  self-loop 
represent  the  action  Move,  gray  arrows  represent  either  Search  or  Pickup,  closed- world  assump¬ 
tion,  any  state  variable  not  presented  in  (b)  is  considered  false.  State  (at(4),searched(4),hasOb  j) 
is  a  goal  state. 
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Object 

Bathroom 

Loc 

Kitchen 

ation 

Office 

Printer  Room 

coffee 

0.08 

0.72 

0.18 

0.02 

cup 

0.42 

0.36 

0.12 

0.10 

papers 

0.00 

0.13 

0.70 

0.17 

pen 

0.15 

0.23 

0.35 

0.27 

toner 

0.05 

0.02 

0.06 

0.87 

Table  8.1:  Prior  probability  used  in  our  service  robot  experiments.  These  probabilities,  obtained 
using  ObjectEval  [Samadi  et  al.,  2012],  represent  the  probability  of  the  object  being  in  a  given 
location  type.  The  mode  of  each  prior  is  shown  in  bold. 


In  order  to  show  the  effects  of  each  parameter  in  the  solution  of  the  problem,  consider  the 
policies  7r j ,  for  j  e  (0, . . . ,  r},  in  which  the  agent  explores  the  first  j  offices  rows,  then  explores 
the  kitchen  and  finally  the  remaining  r  —  j  offices  row.  For  all  n3,  the  exploration  stops  once  the 
object  is  found.  For  instance,  if  poffice  =  1,  then  the  only  policy  that  explores  the  kitchen  is  7r0 
since  no  office  is  explored  before  the  kitchen,  and  all  other  policies  stop  exploring  after  the  first 
office  is  visited. 

Figure  8.4  shows  the  average  cost  of  following  the  policies  tt?  from  the  location  0  in  the  map 
from  Figure  8.3(a).  Each  plot  of  Figure  8.4  varies  either  k,  r,  Pkitchen  or  poffi ce  while  fixing  the 
other  parameters  to  k  —  10,  r  =  10,  Pkitchen  =  0.9,  p0ffice  =  0.1.  Figure  8.4(c)  shows  that  the 
average  cost  of  n3,  which  is  exponential  in  r,  since  the  cost  depends  on  the  probability  of  not 
finding  the  object  in  a  sequence  of  i  offices,  i.e.,  (1  —  p0ffice)\  which  is  exponential  in  r.  Also,  the 
optimal  policy,  i.e.,  the  lowest  n:i  at  any  point  of  the  plots,  is  either  exploring  the  kitchen  first  (7r0) 
or  all  the  offices  first  (7iy)  for  this  example. 


8.3  Experiments 

We  present  five  different  experiments,  each  of  them  for  a  different  object  over  the  same  map. 
The  objects  considered  in  the  experiments  are:  coffee,  cup,  papers,  pen  and  toner.  The  prior 
distribution  of  the  objects  for  each  location  type  (Table  8.1)  is  obtained  using  ObjectEval  [Samadi 
et  al.,  2012],  a  system  that  infers  this  information  using  the  web.  Also,  we  consider  that  the  object 
is  never  in  the  hallways,  i.e.,  phanway  =  0. 

For  all  the  experiments,  we  consider  the  map  depicted  in  Figure  8.5.  The  graph  representing 
this  map  contains  126  edges  and  121  nodes,  i.e.,  locations:  2  bathrooms,  2  kitchens,  59  offices, 
1  printer  room  and  57  segments  of  hallway.  Since  there  is  no  location  in  which  any  of  the 
considered  objects  can  be  found  with  probability  1,  then,  with  positive  probability,  the  object 
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Figure  8.4:  Average  cost  of  the  policies  ttj  in  the  map  depicted  in  Figure  8.3(a).  The  parameters 
used  are:  k  =  10,r  =  10,  pkitchen  =  0.9  and  poffice  =  0.1.  In  each  plot,  one  of  the  four  parameters 
is  varied  in  the  x-axis.  In  all  plots,  the  best  policy  (bottom  curve)  is  either  7r0  (explore  the  kitchen 
and  then  the  offices)  or  7rr  (explore  the  offices  and  then  the  kitchen).  In  plot  (c),  the  policy  7rr 
varies  as  a  function  of  r,  the  x-axis,  and  is  depicted  in  gray  for  clarity. 
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Location  type 
•  Hallway 
□  Office 
B  Bathroom 
K  Kitchen 
P  Printer  Room 


Figure  8.5:  Floor  plan  used  in  our  service  robot  experiments.  The  embedded  graph  represents 
the  map  given  to  the  planners.  The  initial  location  for  the  experiments  are  represented  by  the 
numbers  1,. . .  ,10. 


might  not  be  found  after  visiting  all  locations.  This  probability  is  approximately  5  x  10~7, 
6  x  10~5,  9  x  10'32,  2  x  10~12  and  3  x  10~3  for  coffee,  cup,  papers,  pen  and  toner,  respectively. 
The  simulations  in  which  this  low  probability  event  happens  are  ignored  and  rerun. 

The  planners  considered  in  the  experiments  are  FF-Replan  (Algorithm  2.2),  UCT  (Sec¬ 
tion  6.4)  and  SSiPP  (Algorithm  3.2).  For  the  latter  two,  we  use  the  FF-heuristic  h^.  for  a 
given  state  s,  hn{s)  equals  the  number  of  actions  in  the  plan  returned  by  FF  using  s  as  initial 
state  and  the  all-outcomes  determinization.  For  UCT,  we  considered  12  different  parametriza- 
tions  obtained  by  using  the  bias  parameter  c  G  {1,2,4,  8}  and  the  number  of  samples  per  decision 
w  G  {10, 100, 1000}.  For  SSiPP,  we  used  LRTDP  as  c-Optimal-SSP-Solver  and  depth-based 
short-sighted  SSPs  for  t  G  {2, 4,  6,  ■  ■  •  ,  20}.  The  experiments  were  conducted  in  a  3.07GHz  ma¬ 
chine  with  4  cores  running  a  32-bit  version  of  Linux.  A  cutoff  of  10  minutes  of  CPU  time  and 
3  GB  of  memory  was  applied  to  each  planner. 
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The  methodology  for  the  experiments  is  as  follows:  each  planner  solves  the  same  problem, 
i.e.,  find  a  giving  object  from  a  particular  initial  location,  100  times.  Learning  is  not  allowed,  that 
is,  SSiPP  and  UCT  cannot  use  the  bounds  obtained  in  previous  solutions  of  the  same  problem 
to  improve  their  performance.  Table  8.2  presents  the  results  as  the  average  and  95%  confidence 
interval  of  number  of  actions  performed  in  each  problem;  For  ease  of  presentation,  only  the  best 
3  parametrizations  of  UCT  and  best  6  parametrizations  of  SSiPP  are  shown. 

Overall,  SSiPP  performs  better  than  the  other  planners  in  55  problems  out  of  60  (approxi¬ 
mately  92%)  while  the  FF-Replan  and  UCT  are  the  best  planner  in  3  and  4  problems  respec¬ 
tively.  Another  clear  trend  is  that  as  t  increases  for  SSiPP,  the  better  is  its  performance.  This  is 
expected  since  the  behavior  of  SSiPP  approaches  the  behavior  of  its  underlying  e-optimal  plan¬ 
ner,  in  this  case  LRTDP,  as  t  increases.  However,  this  improvement  in  performance  is  obtained 
by  increasing  the  search  space  and  consequently  the  running  time  of  SSiPP.  This  trade-off  be¬ 
tween  performance  and  computational  time  is  shown  in  Figure  8.6  where  the  run  time  of  the 
planners  is  presented. 

Looking  at  specific  objects  and  their  priors,  we  can  categorize  the  objects  into:  abundant, 
uniformly  distributed  and  rare.  An  example  of  an  abundant  object  in  the  experiments  is  pa¬ 
pers  since  its  prior  is  0.7  for  office  locations  and  offices  represent  48%  of  the  locations.  Thus, 
the  probability  of  not  finding  papers  is  the  lowest  between  all  the  object  considered:  approxi¬ 
mately  9  x  10”32.  Therefore,  finding  objects  of  this  category  is  not  a  hard  task  and  optimistic 
approaches,  such  as  FF-Replan,  perform  well.  This  effect  is  illustrated  by  the  results  in  third 
block  of  Table  8.2  in  which  the  95%  confidence  interval  of  every  planner  considerably  overlaps. 
A  similar  phenomenon  happens  with  uniformly  distributed  objects,  i.e.,  objects  in  which  their 
prior  is  close  to  an  uniform  distribution,  represented  in  the  experiments  by  pen. 

A  more  challenging  problem  is  posed  by  rare  objects,  i.e.,  objects  in  which  their  prior  prob¬ 
ability  is  concentrated  in  very  few  locations.  In  this  experiment,  coffee,  cup  and  toner  can  be 
seen  as  rare  objects.  As  expected,  FF-Replan  performs  poorly  for  rare  objects  and  extra  rea¬ 
soning  is  necessary  in  order  to  efficiently  explore  the  state  space.  For  instance,  consider  finding 
the  object  cup  starting  at  position  7  (Figure  8.6).  Both  a  kitchen  and  an  office  are  3  steps  away 
from  position  7.  In  the  all-outcomes  determinization  used  by  FF-Replan,  the  planner  will  have 
access  to  a  deterministic  action  that  always  finds  cup  in  the  office  and  in  the  kitchen,  therefore 
FF-Replan  will  randomly  break  the  tie  between  exploring  the  kitchen  and  the  neighboring  office 
from  position  7.  If  the  office  is  explored,  then  FF-Replan  will  explore  all  the  other  offices  in  the 
hallway  between  positions  7  and  3  because  they  will  be  the  closest  locations  not  explored  yet. 
Since  the  prior  for  cup  is  0.12  for  offices,  a  better  policy  is  to  explore  the  kitchen  (prior  0.36)  and 
then  the  two  bathrooms  (prior  0.42)  that  are  at  distance  4  and  5  of  the  kitchen. 
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3.9  ±1 

3.6  ±1 

3.4  ±1 

3.6  ±1 

7 

5.9  ±1 

6.4  ±1 

6.2  ±1 

6.0  ±1 

6.0  ±1 

6.1  ±1 

6.0  ±1 

5.8  ±1 

6.2  ±1 

5.8  ±1 

8 

4.7  ±1 

3.9  ±1 

3.5  ±1 

3.8  ±1 

4.4  ±1 

3.5  ±1 

3.9  ±1 

3.6  ±1 

3.6  ±1 

3.7  ±1 

9 

4.8  ±1 

3.5  ±1 

3.7  ±1 

4.0  ±1 

4.0  ±1 

3.5  ±1 

3.9  ±1 

3.8  ±1 

3.8  ±1 

3.8  ±1 

10 

3.4  ±0 

3.3  ±1 

4.1  ±2 

3.5  ±1 

3.2  ±1 

3.3  ±0 

3.5  ±1 

3.4  ±1 

3.7  ±1 

3.5  ±1 

1 

9.4  ±2 

9.1  ±3 

8.7  ±3 

9.3  ±4 

9.0  ±2 

10.2  ±2 

8.7  ±2 

8.5  ±2 

9.1  ±2 

8.4  ±1 

2 

8.8  ±2 

8.9  ±4 

9.0  ±2 

8.7  ±3 

9.8  ±2 

9.2  ±2 

9.8  ±2 

8.5  ±1 

8.9  ±2 

8.9  ±2 

3 

8.5  ±1 

10.8  ±3 

10.8  ±3 

12.0  ±3 

9.5  ±2 

8.2  ±2 

9.5  ±2 

8.9  ±2 

8.7  ±2 

7.8  ±1 

4 

8.2  ±2 

9.6  ±3 

10.4  ±3 

9.1  ±3 

9.2  ±2 

8.3  ±2 

9.0  ±2 

8.7  ±3 

9.0  ±2 

8.5  ±2 

= 

QJ 

5 

8.7  ±2 

9.6  ±3 

8.6  ±2 

9.7  ±5 

9.6  ±1 

9.9  ±2 

8.8  ±2 

9.0  ±2 

9.4  ±2 

9.1  ±2 

a 

6 

11.1  ±3 

11.0  ±3 

11.7  ±2 

10.8  ±3 

11.0  ±2 

10.7  ±1 

10.6  ±2 

10.0  ±2 

10.1  ±2 

10.0  ±2 

7 

10.9  ±2 

11.7  ±3 

11.9  ±3 

11.4  ±4 

11.4  ±2 

11.1  ±2 

11.2  ±2 

11.3  ±2 

11.2  ±2 

11.5  ±2 

8 

10.7  ±2 

10.4  ±3 

10.9  ±2 

10.5  ±3 

10.1  ±2 

11.8  ±2 

8.6  ±2 

10.8  ±2 

10.4  ±2 

10.2  ±2 

9 

11.3  ±2 

10.4  ±3 

10.6  ±3 

10.9  ±4 

10.2  ±2 

10.9  ±2 

10.8  ±2 

10.9  ±2 

10.0  ±2 

10.9  ±2 

10 

9.7  ±2 

9.3  ±2 

9.9  ±2 

9.7  ±2 

9.4  ±2 

9.8  ±2 

9.5  ±2 

9.6  ±2 

9.9  ±2 

9.5  ±2 

1 

54.1  ±9 

43.2  ±10 

41.9  ±11 

41.3  ±11 

42.8  ±7 

29.5  ±7 

27.2  ±5 

37.9  ±7 

27.1  ±6 

27.9  ±6 

2 

56.8  ±9 

41.9  ±10 

45.7  ±12 

40.3  ±11 

41.5  ±5 

19.0  ±5 

18.3  ±5 

18.7  ±5 

18.5  ±6 

18.3  ±6 

3 

50.1  ±9 

56.6  ±12 

55.3  ±11 

53.1  ±13 

38.5  ±5 

33.1  ±6 

25.3  ±6 

22.4  ±4 

23.4  ±9 

21.2  ±5 

4 

61.3  ±9 

59.3  ±10 

58.0  ±12 

42.2  ±11 

30.2  ±9 

20.7  ±6 

20.5  ±6 

19.1  ±7 

21.3  ±7 

19.3  ±7 

u 

g 

5 

39.3  ±6 

38.9  ±10 

31.5  ±10 

36.5  ±12 

30.2  ±7 

31.8  ±8 

23.9  ±5 

23.2  ±6 

25.0  ±7 

23.6  ±7 

o 

6 

53.3  ±6 

37.5  ±11 

29.8  ±7 

23.1  ±6 

18.6  ±6 

19.6  ±4 

19.0  ±5 

18.9  ±6 

18.4  ±4 

18.6  ±6 

7 

45.5  ±7 

26.4  ±10 

20.7  ±8 

21.2  ±7 

18.3  ±5 

17.9  ±5 

18.0  ±6 

18.4  ±7 

17.6  ±7 

17.9  ±5 

8 

33.9  ±8 

21.5  ±10 

19.8  ±12 

18.7  ±9 

23.4  ±10 

19.7  ±9 

18.8  ±6 

16.7  ±8 

16.2  ±8 

17.1  ±7 

9 

36.8  ±8 

29.9  ±10 

25.9  ±10 

23.6  ±9 

18.5  ±8 

17.6  ±6 

18.8  ±7 

18.3  ±9 

16.6  ±6 

16.2  ±5 

10 

54.5  ±8 

31.5  ±9 

29.5  ±7 

27.6  ±10 

27.8  ±6 

25.1  ±6 

23.0  ±6 

24.1  ±7 

22.6  ±7 

22.1  ±6 

Table  8.2:  Performance  of  different  planners  in  the  service  robot  experiments.  Each  cell  repre¬ 
sents  the  average  and  95%  confidence  interval  of  the  number  of  actions  applied  to  find  the  given 
object  starting  at  location  U  (Figure  8.5).  Bold  font  shows  the  best  performance  planner  for  the 
given  problem,  i.e.,  the  combinations  of  objects  and  initial  locations  represented  by  each  line  of 
the  table. 
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Average  Planning  Time 


-  FF-Replan 


Figure  8.6:  Average  run  time  for  the  planners  to  find  the  objects  papers  and  toner  in  our  service 
robot  problem.  The  y-axis  is  in  log-scale  and  its  unit  is  milliseconds.  Error  bars  omitted  for 
clarity.  The  plot  for  the  other  objects  follows  a  similar  pattern,  with  SSiPP  for  t  —  12  always 
faster  than  UCT  planners  for  w  =  1000. 


The  improvement  in  performance  over  FF-Replan  is  remarkable  for  the  rare  object  toner, 
that  can  be  found  with  probability  0.87  in  one  single  location,  the  printer  room.  For  these  prob¬ 
lems,  both  UCT  and  SSiPP  present  better  performance  than  FF-Replan  and  the  average  number 
of  actions  applied  by  SSiPP,  for  t  >  14,  is  approximately  half  of  the  average  number  of  ac¬ 
tions  applied  by  FF-Replan.  Moreover,  for  the  toner  problems,  the  best  SSiPP  parametrization 
(i.e.,  t  =  20)  solves  the  problem  using  from  39.9%  to  91.1%  of  the  total  actions  used  by  the  best 
parametrization  of  UCT  (w  =  1000  and  c  =  8). 
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8.4  Summary 

In  this  chapter,  we  presented  how  to  solve  the  problem  of  a  software  or  robotic  agent  moving  in 
a  known  environment  in  order  to  find  an  object  using  SSPs  encoded  in  PPDDL,  a  standard  prob¬ 
abilistic  planning  language.  We  empirically  compared  three  different  replanning  approaches  to 
solve  the  proposed  problems:  determinizations  (FF-Replan),  sampling  (UCT)  and  short-sighted 
planning  (SSiPP).  The  experiments  showed  that  the  simpler  and  optimistic  approach  used  by 
FF-Replan  suffices  if  the  object  can  be  found  in  most  locations  with  high  probability  or  nearly 
uniform  across  over  all  locations.  Alternatively,  if  the  probability  of  finding  the  object  is  concen¬ 
trated  in  few  locations,  then  SSiPP  outperforms  the  other  approaches  and,  for  some  parametriza- 
tions,  SSiPP  executes  on  average  less  than  half  of  the  actions  executed  by  FF-Replan  to  find  the 
desired  object. 

It  is  important  to  notice  that  all  the  planners  compared  in  this  chapter  are  domain-independent 
planners.  Due  to  the  strong  geometric  constraints  in  robotics  applications,  most  real  world  robots 
use  domain-dependent  planners.  This  class  of  planners  take  advantage  of  domain  specific  knowl¬ 
edge  to  prune  the  search  space  and  to  employ  more  accurate  heuristics.  For  these  reasons,  it 
is  unlikely  that  SSiPP  (or  any  other  domain-independent  planner)  will  be  able  to  outperform 
domain-dependent  planners  in  real  world  robotics  problems.  Nonetheless,  the  concept  of  short¬ 
sighted  planning  could  be  easily  incorporated  to  domain-dependent  planners  to  improve  their 
performance  in  probabilistic  environments,  such  as  the  finding  object  domain  presented  in  this 
chapter. 
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Conclusion 


This  dissertation  sets  out  to  address  question, 

How  to  plan  for  probabilistic  environments  such  that  it  scales  up  while  offering  for¬ 
mal  guarantees  underlying  policy  generation? 

This  final  chapter  summarizes  the  contributions  we  have  presented  to  answer  this  question.  We 
also  describe  some  new  directions  for  future  work  that  this  thesis  raises. 


9.1  Contributions 

The  contributions  of  this  thesis  can  be  grouped  into  four  classes: 

1.  Short-Sighted  Models 

We  introduced  the  concept  of  short-sighted  probabilistic  planning  problems,  a  special  case 
of  probabilistic  planning  problems  in  which  the  state  space  is  pruned  and  actions  are  not 
simplified.  Three  short-sighted  models,  based  on  different  criteria  to  prune  the  state  space, 
were  presented:  depth-based  short-sighted  problems,  in  which  all  the  states  are  reachable 
using  no  more  than  a  given  number  of  actions;  trajectory-based  short-sighted  problems, 
in  which  all  states  are  reachable  with  probability  greater  or  equal  than  a  given  threshold; 
and  greedy  short-sighted  problems,  in  which  the  states  have  the  best  trade-off  between 
probability  of  being  reached  and  expected  cost  to  reach  the  goal  from  them. 

2.  Short-Sighted  Probabilistic  Planners 

We  introduced  the  Short-Sighted  Probabilistic  Planner  (SSiPP)  algorithm  that  solves  prob¬ 
abilistic  planning  problems  by  iteratively  generating  and  solving  short-sighted  subprob¬ 
lems.  We  also  presented  three  extensions  of  SSiPP:  Labeled-SSiPP,  which  improves  the 
convergence  of  SSiPP  to  the  e-optimal  solution;  Parallel  Labeled-SSiPP,  which  solves  mul- 
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tiple  short-sighted  problems  in  parallel  to  speedup  the  search  for  the  e-optimal  solution; 
and  SSiPP-FF,  which  improves  the  efficiency  of  SSiPP  when  a  suboptimal  solution  is  ac¬ 
ceptable. 

3.  Theoretical  Analysis 

We  proved  that  the  optimal  solution  of  short-sighted  subproblems  are  lower  bounds  for 
the  original  probabilistic  planning  problem  associated  with  them.  Moreover,  we  showed 
that  solutions  for  depth-based  short-sighted  subproblems  can  be  executed  for  at  least  t 
steps,  where  t  is  a  parameter,  in  the  original  problem  without  replanning.  We  proved  that 
SSiPP,  Labeled-SSiPP  and  Parallel  Labeled-SSiPP  are  asymptotically  optimal  and  derived 
an  upper  bound  on  the  number  of  iterations  necessary  for  Labeled-SSiPP  and  Parallel 
Labeled-SSiPP  to  converge  to  the  e-optimal  solution. 

4.  Empirical  Evaluation 

We  provided  a  rich  empirical  evaluation  of  the  proposed  algorithms  for  two  different  tasks: 
(i)  to  find  an  e-optimal  solutions,  and  (ii)  to  compute  a  solution  under  the  International 
Probabilistic  Planning  Competition  [Younes  et  al.,  2005,  Bonet  and  Givan,  2007,  Bryce 
and  Buffet,  2008]  rules.  Several  domains  were  used  in  our  empirical  evaluation,  including 
domains  proposed  in  this  thesis  and  benchmarks  from  the  probabilistic  planning  commu¬ 
nity.  We  also  empirically  showed  how  a  mobile  service  robot  moving  in  a  building  in  order 
to  find  an  object  can  use  short-sighted  planning  to  improve  its  performance. 

9.2  Directions  for  Future  Work 

This  thesis  opens  up  new  interesting  directions  for  further  research  in  probabilistic  planning. 
Moreover,  short-sighted  planning  is  a  general  concept  that  can  be  applied  to  any  planning  under 
uncertainty  model.  Next,  we  enumerate  a  number  of  directions  for  future  work. 

9.2.1  Automatically  Choosing  a  Short-Sighted  Model  and  its  Parameters 

Short-sighted  SSPs  can  exploit  the  underlying  structure  of  the  problem  through  their  different 
simplifications  of  the  state  space  and  parameters,  e.g.,  the  parameter  t  for  depth-based  short¬ 
sighted  SSPs  and  p  for  trajectory-based  short-sighted  SSPs.  Our  experiments  show  that  the 
performance  of  SSiPP  and  its  extensions  can  be  further  improved  by  optimizing  the  choice  of 
short-sighted  model  used  and  its  parameters  for  each  domain. 

A  future  direction  is  to  derive  (heuristic)  methods  that  automatically  choose  or  adapt  the 
short-sighted  model  and  its  parameters  for  the  current  SSP  being  solved.  One  approach  to  tackle 
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this  problem  is  to  model  it  as  a  multi-armed  bandit  problem  in  which  the  combinations  of  short¬ 
sighted  models  and  their  parameters  are  different  arms. 

A  different  approach  is  to  perform  automatic  domain  analysis.  This  technique  has  been 
successfully  applied  to  automatically  elicit  knowledge  implicitly  embedded  in  the  domain,  e.g., 
generation  of  state  constraint  [Gerevini  and  Schubert,  1998,  Gerevini  and  Schubert,  2000,  Hoff¬ 
mann,  2011]  and  removal  of  irrelevant  fact  and  actions  [Nebel  et  al.,  1997,  Haslum  and  Jonsson, 
2000,  Haslum,  2007].  It  would  be  interesting  to  explore  what  features  can  be  extracted  from 
preprocessing  the  domain  that  can  guide,  or  constraint,  the  choice  of  short-sighted  model  and  its 
parameters. 

9.2.2  Transfer  Learning  using  Short-Sighted  Problems 

Transfer  learning  for  probabilistic  planning  can  be  seen  as  the  problem  of  solving  an  SSP  §  by 
reusing  policies  for  similar  SSPs.  Formally,  let  7r§  be  an  optimal  policy  for  §  and  define  a  new 
SSP  S’  in  which  only  the  set  of  goal  states  G  differs  between  S  and  S’.  In  this  case,  S’  has 
a  different  optimal  value  function  F§*  that,  most  likely,  yields  to  optimal  policies  tG,  different 
from  7r§.  Transfer  learning  aims  to  use  7r§  to  guide  the  learning  of  177  and  thus  speed  up  the 
search  for  7r§,  [Fernandez  and  Veloso,  2006]. 

Although  and  7tg,  can  be  different,  the  policy  for  some  of  the  short-sighted  SSPs  used 
during  the  solution  of  both  S  and  S'  might  still  be  the  same.  This  is  potentially  interesting  for 
problems  that  share  states  that  must  always  be  visited  in  both  in  order  to  compute  their  optimal 
solutions,  e.g.,  the  intermediary  doors  in  the  hallway  problems  (Example  5.1  on  page  53). 

It  would  be  interesting  to  explore  how  the  solutions  of  different  short-sighted  SSPs  are  af¬ 
fected  when  the  goal  of  the  original  SSP  is  changed.  Another  step  in  this  direction  is  the  analysis 
of  the  necessary  conditions  of  SSPs  in  order  to  be  able  to  efficiently  reuse  the  optimal  policies  of 
their  associated  short-sighted  SSPs. 

9.2.3  Short-Sighted  Planning  for  Imprecise  Probabilistic  Problems 

In  many  real-world  problems,  it  is  not  possible  to  obtain  a  precise  representation  of  the  transition 
probabilities  in  order  to  use  probabilistic  planning  models.  This  may  occur  for  many  reasons, 
including  imprecise  or  conflicting  elicitations  from  experts,  insufficient  data  from  which  to  esti¬ 
mate  precise  transition  models,  or  non- stationary  transition  probabilities  due  to  insufficient  state 
information. 

Several  models  were  proposed  [Satia  and  Lave  Jr,  1973,  Givan  et  al.,  2000,  Trevizan  et  al., 
2007,  Delgado  et  al.,  2011]  to  handle  this  uncertainty  in  the  transition  probabilities  and  their 
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drawback  is  the  increased  computational  complexity  to  find  an  optimal  policy.  Notice  that  the 
previously  proposed  problem  relaxations  for  probabilistic  planning  do  not  obtain  robust  solutions 
for  imprecise  probabilistic  problems.  For  instance,  solutions  obtained  using  determinizations 
completely  ignore  the  extra  information  regarding  the  imprecise  probabilities  of  actions  and  relax 
them  to  deterministic  actions. 

Alternatively,  the  extension  of  short-sighted  planning  to  imprecise  probabilistic  planning 
problems  has  potential  to  efficiently  compute  robust  solutions  since  the  structure  of  actions  are 
not  simplified.  Therefore,  short-sighted  models  for  imprecise  probabilistic  problems  would  be 
able  to  represent  both  loops  in  the  states  and,  more  importantly,  the  imprecision  in  the  action  rep¬ 
resentation,  e.g.,  a  probability  interval  for  each  effect.  In  order  to  extend  short-sighted  planning 
to  imprecise  probabilistic  problems,  two  steps  are  necessary:  to  define  short-sighted  imprecise 
models,  and  to  extend  SSiPP  to  handle  imprecise  probabilistic  problems. 

9.2.4  Short-Sighted  Decentralized  SSPs  with  Sparse  Interactions 

One  assumption  of  SSPs  (and  MDPs)  is  that  there  is  only  one  single  agent  executing  actions  and 
thus  modifying  the  environment.  If  more  than  one  agent  is  modifying  the  environment,  i.e.,  a 
multi-agent  problem,  then  SSPs  need  to  be  generalized  to  encompass  the  interaction  between 
agents.  One  possible  approach  to  model  such  problems  is  to  assume  joint-observability,  i.e., 
each  agent  is  aware  of  the  state  and  actions  performed  by  all  other  agents,  which  seldom  holds 
in  practice.  If  joint-observability  is  completely  ignored,  then  finding  the  optimal  policy  for  even 
the  case  where  agents  share  the  same  cost  function  is  undecidable  [Bernstein  et  al.,  2002]. 

In  practice,  joint-observability  is  only  required  in  specific  parts  of  the  environment,  i.e.,  the 
interaction  between  agents  is  sparse  [Melo  and  Veloso,  2009,  Melo  and  Veloso,  2011].  One 
example  of  sparse  interactions  is  two  or  more  service  robots  navigating  in  a  building  (Figure  9.1). 
These  robots  coordinate  their  actions  during  navigation  only  when  they  need  to  pass  through  the 
same  doors  or  a  narrow  hallway.  More  generally,  coordination  is  required  between  agents  only 
in  regions  of  the  state  space  in  which:  (i)  there  is  a  conflict  of  resource;  or  (ii)  direct  interaction 
is  needed  in  order  to  achieve  a  goal. 

One  novel  approach  to  solve  sparse  interaction  problems  would  be  to  use  short-sighted  prob¬ 
abilistic  planning.  The  benefits  of  using  short-sighted  models  for  this  class  of  problems  is  that  the 
local  interactions  can  be  perfectly  modeled  while  future  and  unlikely  interactions  can  be  approx¬ 
imated.  Besides  extending  SSiPP  in  order  to  handle  multi-agent  interactions,  this  novel  approach 
also  requires  the  proposal  of  new  short-sighted  models  to  remove  and  heuristically  approximate 
unlikely  interactions  between  agents. 
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Figure  9.1:  Example  of  sparse-interaction  multi-agent  planning  problem.  Two  robots,  R\  and 
R-2,  have  to  navigate  in  the  depicted  map  to  reach  their  goal  locations,  G\  and  G-2  respectively. 
Coordination  between  Ri  and  Ii>  is  only  need  if  and  when  both  try  to  cross  the  narrow  hallway 
at  the  same  time.  Figure  adapted  from  [Melo  and  Veloso,  2009]. 

9.2.5  Short-Sighted  Partially  Observable  Probabilistic  Problems 

Partially  Observable  MDPs  (POMDPs)  generalize  MDPs  (Section  2.1)  by  modeling  agents  that 
have  incomplete  state  information  [Sondik,  1971].  A  common  approach  to  solve  POMDPs  is 
to  convert  them  to  belief  MDPs,  i.e.,  an  MDP  in  the  belief  space,  and  RTDP  (Section  2.3.1) 
can  be  applied  to  solve  the  obtained  belief  MDPs  [Bonet  and  Geffner,  2009].  This  adaptation 
of  RTDP,  RTDP-Bel,  handles  the  continuous  state  space  of  the  belief  MDPs  by  using  function 
approximations  [Bertsekas  and  Tsitsiklis,  1996],  specifically  by  discretizing  the  belief  space  into 
a  finite  grid. 

The  main  drawback  of  the  representation  using  function  approximations  is  that  convergence 
is  no  longer  guaranteed.  However,  in  practice,  RTDP-Bel  performance  is  comparable  with  state- 
of-the-art  POMDP  solvers  and  outperforms  them  in  domains  such  as  RockSample  and  LifeSur- 
vey  [Bonet  and  Geffner,  2009].  An  interesting  future  direction  is  to  use  RTDP-Bel  as  the  optimal 
solver  for  SSiPP  and  apply  the  proposed  short-sighted  models  in  this  thesis  to  model  subprob¬ 
lems  of  belief  MDPs.  New  definitions  of  short-sighted  models  that  are  specific  for  belief  MDPs 
might  be  necessary  in  order  to  make  this  approach  feasible. 


9.3  Summary 

This  thesis  contributes  a  number  of  techniques  to  effectively  solve  probabilistic  planning  prob¬ 
lems.  The  cornerstone  of  the  presented  algorithms  is  the  concept  of  short-sighted  problems,  a 
novel  approach  to  relax  probabilistic  planning  problems.  We  proved  the  relationship  between  so¬ 
lutions  of  short-sighted  subproblems  and  the  original  probabilistic  planning  problem  associated 
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with  them,  as  well  as,  the  main  properties  of  our  algorithms,  e.g.,  optimality.  We  demonstrated 
the  effectives  of  our  presented  algorithms  and  different  short-sighted  models  in  a  rich  empirical 
comparison  against  state-of-the-art  probabilistic  planners  in  several  domains. 


Appendix  A 


Proof  of  Lemmas  3.1  and  3.2 


Proof  of  Lemma  3.1.  If  s  G  Ss,  D  G,  then  (B^tV)(s)  =  ( BkV)(s )  =  0  for  all  k  E  N*  by  the 
definitions  of  B  and  Bs  t.  Otherwise,  s  G  Ss,t  \  GS;(,  therefore  1  <  k  <  t.  We  prove  this  case  by 
induction  on  k: 

•  If  k  —  1,  then  by  the  definition  of  short-sighted  SSPs  (Definition  3.2),  we  can  replace  CSjt 
by  C  in  (BSjtV)(s)  as  follows: 

(Bs>tV)(s)  =  min  ^  P(s'|s,  a)  [C^s,  a,  s')  +  F(s')]  +  a)Cs,t(s,  a,  s') 

s/GSS)t\Ga  Ga 

=  min  P(s'|s,  a)  [C(s,  a,  s')  +  V(s')] 

a  ^ J 

s'£Ss,t\Ga 

+  ^  P(s'|s,  a)  [C(s,  a,  s')  +  L (s')] 

s'GGa 

=  min  Pfs'ls,  a)  [CYs,  a,  s')  +  V(s')] . 
s'ESs,t 

Since  minSaeGa  <5(s,  sa)  >  1,  then  {s'  G  S|P(s'|s,a)  >  0,  Va  G  A}  C  SS)f  and  the 
previous  sum  over  SSjt  equals  the  same  sum  over  S.  Therefore  (BSttV)(s)  =  ( BV)(s ). 

•  Assume,  as  induction  step,  that  this  Lemma  holds  for  k  G  {1,  •  •  •  ,  c}  where  c  <  t.  For 

k  —  c  +  1,  since  minSoeGo  5(s,  sa)  >  c+1  >  1,  then  {s'  G  Ga|P(s'|s,  a)  >  0,  Va  G  A}  =  0. 

Thus, 

(■ Bs^BcV)){s )  =  min  V  P(s'|s,a)[GM(s,a,s')  +  (PcK)(s')] 

a  z ' 

s'eSs,t 

=  min  P(s'|s,a)  [C(s,a,  s')  +  (Pcl/)(s')] . 

a  ' 

s'eSSjt 
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Since  c  +  1  <  t  and  s  G  SS:t  \  GSjt,  then  {s'  G  S|P(s'|s,  a)  >  0,  Va  G  A}  C  Ss,f  and  we 

can  expand  the  previous  sum  from  s'  G  SS;t  to  s'  G  S,  i.e., 

P(s’\s,  a)  [C(s,  a,  s')  +  ( BcV)(s ')]  =  P(s'|s,  a)  [C(s,  a,  s')  +  (PcV")(s')] . 

s'eSs,t  s'e  s 

Therefore  (Psc^V)(s)  =  ( BS)t(BcV)){s )  =  ( Bc+1V)(s ) 

□ 


Proof  of  Lemma  3.2.  By  the  definitions  of  B  and  BStt,  we  have  the  following  trivial  cases:  (i)  if 
s  G  SSj  fl  G,  then  (BktV)(s)  =  ( BkV)(s )  =  0;and(ii)ifs  G  Ga,  then  (BktV)(s)  =  0  <  ( BkV){s ). 
Thus,  for  the  rest  of  this  proof,  we  consider  that  s  G  \  Gs  t. 

Let  m  denote  minSaGG„  5(s,  sa).  If  m  >  k,  then  (BktV)(s)  =  ( BkV)(s )  by  Lemma  3.1.  We 
prove  the  other  case,  i.e.,  m  >  k,  by  induction  on  i  =  k  —  m: 

•  If  i  =  1,  then  (BktV)(s)  =  {Bs^BksflV)){s)  =  (Bs,t(BftV))(s)  thus,  by  Lemma  3.1, 

(B^  V)(s)  =  ( Bs,(BmV))(s ) 

=  min  V  P(s'|s,  a)  [(7(s,  a,  s')  +  (BmV)(s')] 

a  — J 

s/GSs,t\Ga 

+  ^  P(V |s,  a)  [C(s,  a,  s')  +  1/ (s')] 

s/GGa 

<  min  P(s'|s,  a)  [C(s,  a,  s')  +  (P”'T/)(s')] , 

a  ' 

where  the  last  derivation  is  valid  because  V  is  monotonic  by  assumption.  Since  s  G  SSjt\Gsj, 
then  {s'  G  S|P(s'|s,a)  >  0,  Va  G  A}  C  Ss,t  and  we  can  expand  the  last  sum  over  S. 
Therefore,  (BktV)(s)  =  (. Bsj(BmV))(s )  <  ( BkV)(s ). 

•  Assume,  as  induction  step,  that  it  holds  for  i  G  {1, . . . ,  c}.  Then,  for  i  =  c  +  1,  i.e., 
k  —  m  +  c+  1,  we  have  that 

(Bk,V)(s)  =  (BsABZ+c(V))(s) 

=  main  P(s'\^a)  [<?(«,  a, «')  +  (P^t+cI/)(s')] 

s/GSs,t\Ga 

+  ^  P(s 'Is,  a)  [C(s,  a,  s')  +  1/ (s')] . 

s'GGa 


Ill 


Since  V  is  monotonic,  we  have  that  V(s')  <  ( Bk+1V)(s ')  for  all  s'  G  S.  Also,  by  the 
induction  assumption,  (B™+cV)(s')  <  (. Bm+cV)(s ').  Thus, 

(Bk,,V)(s)  <  min  £  P(s'|s,  a)  [C(s,  a,  s')  +  (B”*+'y)(s')] 

s'GS  s,t 

=  min  P(s/|s,  a)  [C(s,  a,  s7)  +  (Bm+cV)(s') ]  , 
s'G  s 

because  {s'  G  S|P(s'|s,  a)  >  0,  Va  G  A}  C  SS)t.  Therefore  (B^tV)(s)  <  ( BkV)(s ). 

□ 
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